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(57) Abstract 

A method and system for detecting coinci- 
dences in a data set of objects, where each object 
has a number of attributes. Iteratively, equally-sized 
subsets of the data set are sampled, and coincidences 
(co-occurrences of a plurality of attribute values in 
one or more objects in the subset) are recorded. For 
each coincidence of interest, the expected coinci- 
dence count is determined and compared with the 
observed coincidence count; this comparison is used 
to determine a measure of correlation for the plurality 
of attributes for the coincidence. The resulting set of 
k-tuples of correlated attributes is reported, a k-tuple 
of correlated attributes being a plurality of attributes 
for which the measure of correlation is above a prede- 
termined threshold. The method and system (imple- 
mented on an array of processing nodes) is suitable 
for protein structure analysis, e.g. in HIV research. 
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COINCIDENCE DETECTION METHOD, 
PRODUCTS AND APPARATUS 

TECHNICAL FIELD 

The invention relates to methods, devices and systems for coincidence detection 
among a multitude of variables. In addition, the invention relates to applying coincidence 
detection methods to various fields, and to products derived from such application. 

BACKGROUND ART 
k- tuples of Correlated Attributes 

The discovery of correlations among pairs or k-tuples of variables has applications in 
many areas of science, medicine, industry and commerce. For example, it is of great interest 
to physicians and public health professionals to know which lifestyle, dietary, and 
environmental factors correlate with each other and with particular diseases in a database of 
patient histories. It is potentially profitable for a trader in stocks or commodities to discover 
a set of financial instruments whose prices covary over time. Sales staff in a supermarket 
chain or mail-order distributor would be interested in knowing that consumers who buy 
product A also tend to buy products B and C, and this can be discovered in a database of 
sales records. Computational molecular biologists and drug discovery researchers would 
like to infer aspects of 3D molecular structure from correlations between distant sequence 
elements in aligned sets of RNA or protein sequences. 

One formulation of the general problem which encompasses many diverse 
applications, and which facilitates understanding of the principles described herein is a matrix 
of discrete features in which rows correspond to "objects" (such as individual patients, stock 
prices, consumers, or protein sequences) and the columns correspond to features, or 
attributes, or variables (such as lifestyle factors, stocks, sales items, or amino acid residue 
positions). 

Mathematical methods for determining a measure of the type, degree, and statistical 
significance of correlation between any two, or even three or four, particular variables are 
widespread and well-understood. These methods include linear and nonlinear regression for 
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continuous variables and contingency table analysis techniques for discrete variables. 
However, great difficulties arise when one tries to estimate correlation - or just estimate 
joint or conditional probabilities - over much larger sets of variables. This intractability has 
one main cause - there are too many joint attribute-value probability density terms - and this 
manifests itself in two serious problems: (1) computing and storing frequency counts over all 
terms, over the database, requires too much computation and memory; (2) there is usually an 
insufficient number of database records to support reliable probability estimates based on 
those frequency counts. 

Let us consider some details. For M records (objects), variables (attributes, 
fields), and supposing that each variable has the same set of \A\ possible values, 

there are ( k ) = (N . k)m k-tuples of columns. Adding the number of k-tuples for each k- 1, 
2, . . N A results in 2 N - 1 such tuples of all sizes. This exponential complexity has been a 
major obstacle standing in the way of higher-order probability estimation and correlation 
detection methodologies. 

One natural way to think about this complexity is in terms of the power set of the set 
of column variables. This power set forms a mathematical lattice under the operation c, a 
"tower" corresponding to a graph whose nodes are subsets of this set of column variables. 
(Note that if a set has N members, the power set has 2 N members). From .his viewpoint, 
two nodes representing subsets o l and o 2 are connected if and only if either o,c o 2 oro 2 c 

Op We say that o 2 's node is above a/s if o l c a 2 This gives a natural meaning to the term 
"higher-order", as appearing higher up the tower. We call the bottom, the null set node, the 
Oth tier; the single column terms form the first tier, and so on. 

Continuing with the tower analogy, we note that each "floor" of this edifice contains 

( k ) "suites", and each suite contains \A\ k "rooms". In other words, the kth level of the 
lattice 

N 

corresponds to ( k ) different k-tuples of column variables, and associated with each k-tuple 
is an 
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(\A\ by . . . by \A\/) contingency table, each cell of which must store the counted 
frequency of a particular joint symbol (a jU a a , , . . , a ik ) were one to use a classical 
contingency table test for the correlation between those particular k columns. (See Figure 1). 

For any * e { 1, 2, ...,#}, for any particular k-tuple of columns (c JU c j2 ,. . , c >Jfc ), 
5 there are | A\ k possible joint values. For any k € { 1, 2, . . ., N), for any particular k-tuple of 
columns (c yl , c j2i . . . , c Jfc \ the estimation of Kullback divergence or other correlation function 
using the dataset is at least an Q(Mk) or Q(| computation, depending upon the relative 
sizes ofM, k and \A\. 

A comprehensive probabilistic model of the database must be able to specify 
probability 

N 

estimates for S k=J ( k ) \A\ k terms. This means, for example in the computational 
molecular biology domain, that for a tiny heptapeptide sequence family, each sequence 
having a length of seven amino acid residues, there are 1,801,088,540 terms to specify. For 
an unrealistically small RNA of fifteen nucleotides in length, over the smaller RNA alphabet 
of four base symbols, there are 30,517,578,124 terms. 

Clearly the models can become intractably huge. What about the space of possible 
models through which a modelling/learning procedure must search? Consider a latent- 
variable model, which seeks to explain correlations between sets of observable variables by 
20 positing latent variables whose states influence the observables jointly. Since each model 
must specify a set of k-tuples of variables, and there are exp(2, 2^) (i.e., 2 to the power 2") 
such sets, there are exp(2, 2*) possible models in the worst-case search space. 

Various methods for determining a measure of higher-order probabilities will 
circumvent the combinatorial explosion through severe prior restrictions on the width k (See 
25 Figure 3), the locality (Figure 2), the number, or the degrees of correlation of the higher- 
order features sought, and on the kinds of models entertained (See Figure 4). 
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Three Goals or Probability Estimation 

It is useful, before discussing details of existing methods and of the current invention, 
to delineate three different possible goals of probability estimation in large datasets, each 
corresponding to a large body of research and current practice: 

5 1 . Estimation of the fully-specified, fully higher-order joint probability 

distribution: Estimate a probability density q that specifies 

for all k-tuples of attributes and possible values. 

2. Hypothesis testing, for particular hypotheses concerning particular attributes 
10 and particular variables: For example, are the data consistent with the hypothesis 

that columns c n , c t7s . . . , c A are independent? 

3. Feature detection, or "data mining": Detect the most suspicious coincidences, 
for example, joint attribute occurrences that are more probable than would be 
predicted from lower-order marginals. Related to this, find the most highly 

15 correlated k-tuples of columns. 

It is the feature detection and data mining applications that are most relevant to the 
present invention. However, some of the most successful ways to estimate a full higher- 
order joint probability distribution of a database require the specification of exactly those 
higher-order terms which represent high correlations among sets of k>2 variables and 
20 invoking maximum entropy assumptions, and therefore the current invention is aimed at 
those applications as well. 

Related Work 

Various mathematical and computational methods have been proposed and used to 
estimate higher-order probabilities, to detect correlations, and to model higher-order 
25 database relationships. All such prior methods either perform a global, sometimes 

exhaustive search through all possible k-tuples of variables, which is too costly, or they 



-4- 



SUBSTITUTE SHEET (RULE 26) 



BNSDOCID: <WO 98431 82A1_I_> 



WO 98/431 82 PCT/C A98/00273 



avoid the complexity altogether by limiting their search to only k-tuples of a specific fixed, 
small size k. (Often, k = 2 so only pairwise correlations are ever considered). 



Below are listed some representative examples of related work. 



Assuming Independence between Attributes. The easiest way to avoid the 
5 complexity of higher-order correlations is just to pretend that they do not exist. Many of the 
algorithms and computer programs, historically dominant in some fields of application of the 
current method, simply construct and use a model of the data in which all variables, all 
attributes, are independent. For example, the modelling of DNA and protein sequences, in 
computational molecular biology, is often done with consensus sequences and profiles, 
10 which assume incorrectly that the different base or amino acid residue positions are 

independent. Reliance on such models can obscure crucial functional and structural insights 
into the DNA or proteins being modelled. 



Prior Limits on k. One proposal for Gibbs models of databases is based on the 
use of Gibbs potentials, and it proposes a hashing method for calculating these special terms. 
15 Each Jfcth-order potential requires an estimation of a Ath-order joint probability density as 

well as some number of lower-order (typically k-/th-order) densities. The asymptotic time 
complexity of Miller's pattern-collection subroutine, the major component of the potential 
calculation, is, when interpreted in our terminology: 



K 



20 M • S N A 2* « OiMN*) 

«(* ) 

where K = k max is the highest order of features for which one will search and by which one 
will represent database objects. This exponential blow-up prevents one from searching for 
higher-order features (HOFs) of any order k much higher than 4 or 5 in databases with 
25 hundreds of attributes. 

Many methods, in different application areas, simply limit k to k - 2. For example, 
pairwise inter-residue correlation methods discover second-order features that can be useful 
in the prediction of protein structure and function and that can be built into classifiers more 
sensitive than first-order sequence classifiers and fold-recognizers. To the extent that k-ary 
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interactions are important, and to the extent that such interactions leave traces in sets of 
homologous sequences, the painvise methods are deficient. One can try to infer k-ary 
correlations from sets of 2-ary correlations [9] (essentially by computing the transitive 
closure of the "CorrelatesWith" binary relation), but this heuristic can lead to trouble: high 
5 painvise correlations among variables x, y t z do not in general imply, nor are they necessarily 
implied by, a high 3-ary correlation (as measured by Kullback divergence) of the three 
variables x. y, z. In other application areas, such as the study of multiple drug interactions, it 
is similarly true that important higher-order relationships can be missed by painvise 
correlation detection methods. 

The Paturi et al Method for Identifying the Most Correlated Pair of Random 
Variables. A method has been reported for the problem of finding the most highly 
correlated pair X„ Xj of variables from among a large set of N random binary variables Xj, 
X 2 , . . . , X N . The method is easily extended to finding the most correlated k-tuple of random 
binary variables, but at a significant increase in computational complexity, and only for k *2 
fixed a priori . It uses a definition of correlation that has Correlation (X it Xj)- P[Xi=XJ[ 
over some set of M samples {A"",, X m 2 , . . A" N ] JH . u . iifA/ . (Here P[X t = Xj\ means "the 
probability that variable X { has the same value, or state, as variable Xj). Much of the 
computational complexity, both time complexity and sample complexity, of their method can 
be incurred in trying to separate two or more nearly equally-correlated pairs (or k-tuples) of 
variables. 

The two variants of the Paturi method are asymptotically quadratic and sub- 
quadratic in N, respectively, the faster procedure requiring more sampling. When the 
method is extended to search for the biggest k-ary correlation, where correlation is now 
defined as P[X a = X a = . . . = X jk ], the time complexity grows to approximately 
25 OflfN^o^N). Search for highly correlated attribute cliques of width k much greater than 5 
or 6 in very large datasets is once again ruled out. 

Hidden Markov Models. Hidden Markov Models (HMMs) have been used 
widely and with increasing success in recent years, in both automatic speech recognition and 
in the modelling of protein, DNA, and RNA sequences. 
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Although some groups have reported significant success in modelling protein 
sequence families and continuous speech data with HMMs, nonetheless there are great 
improvements to be made in learning time and model robustness by the "hardwiring" of pre- 
selected higher-order features into HMMs. (This has been investigated for HMM-like 
5 recurrent neural networks, in different domains). 

Some of the same reasons why HMMs are very good at aligning the protein 
sequences or recorded utterances in the first place, using local sequential correlations, make 
such methods less useful for finding the important sequence-distant correlations in data that 
has already been partially or completely aligned. The phenomenon responsible for this 
10 dilemma is termed "diffusion". 

A first-order HMM, by definition, assumes independence among sequence columns, 
given a hidden state sequence. Multiple alternative state sequences can in principle be used 
to capture longer-range interactions, but the number of these grows exponentially with the 
number of k-tuples of correlated columns. 

15 The Agrawal et al Method for Discovery of Association Rules. This method 

was developed in perhaps the purest data mining context, the automatic extraction of 
knowledge-base rules from databases. It considers a database of M transactions (objects, 
rows) and N items (attributes, columns) and seeks to extract rules of the form a b. It 
therefore seek pairs of attributes a, b such that "transactions that contain a tend to contain 

20 b J \ hence those pairs with high values for p(b \a). "People who buy CD players tend to buy 
CDs.", is just one example suggesting the potential commercial interests in such methods. 
(More generally, one can search for sets of attributes with high p(b u b 2 , . b k \a u a 2i . . 

A rule a=*b is said to have: 
25 1 . confidence c if c% of transactions containing a also contain b (hence, roughly, if 

( P(o) ) (100) ), 

s . 

30 2. support s if s% of transactions contain a and b (hence, roughly, if p(a, b) > 100 ) . 
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The goals behind this method are different from the objectives of the current 
invention. However, the different objectives are brought closer together if one focuses on 
the Agrawal method's discovery of symmetric rules (so that the search is for attribute pairs 
5 displaying high 

values for both and ), and if one reduces the emphasis on support (so that 
coincidences 

that are suspicious, even if occurring rarely, are sought). 

10 The Agrawal method is shown to have 0(||S|| • MN) time complexity, where is 

the sum of all values Support (a) for an exponentially large number of k-tuples a of 
attributes, of any size 1 ^ k <,N, that reach a particular stage of processing in this procedure. 
Hence the method is 0(2*) in the worst case. A series of empirical tests are performed on 
what they considered to be realistic datasets for their domain. The running time of the 

1 5 procedure grew only linearly with the number M of transactions, but the number of items, or 
attributes, was held constant at N A = 1000, and their constructed datasets probably contained 
no correlated k-tuples of width k > 10. An analysis of their algorithm, which is based on an 
incremental build-up of Ath-order cliques from k-7th-order cliques, makes clear that the 
method takes much more computation to find wide HOFs (large k) than narrower HOFs 

20 (lower k) of equivalent statistical significance. 

Steeg, Robinson, Deerfield, Lappa - 1993. Some rough, heuristic methods have 
been presented for finding k-tuples of correlated residues (positions) in sets of aligned 
protein sequences. One of the presented methods employed one embodiment of a ' 
rudimentary version of the representation and detecting coincidences steps of the described 
25 herein. 

Alternative methods of, and devices for, finding correlations between attributes, and 
applications for those correlations, are required. 
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DISCLOSURE OF THE INVENTION 

In a first aspect the present invention provides a coincidence detection method for use with a 
data set of objects having a number of attributes. The base method includes the following 
steps: 

5 • representing a set of M objects in terms of a number N A of variables 

("attributes"), where an attribute is said to occur in an object if the object 
possesses the attribute; 

sampling a subset of r s out of the M objects, for each iteration among a 
predetermined number of iterations; 
10 • detecting and recording coincidences among sets of k of the attributes in each 

sampled subset of objects, a coincidence being the co-occurrence of 1 ^ k <> 
N A attributes in the same h^ out of r s objects in the sampled subset, where 0 ^ 
hi * h \ 

• determining an expected count of coincidences for any set of k attributes and 
15 a predetermined number of iterations of sampling and coincidence-counting 

as described above, the determining being performed before sampling and 
collecting, at the same time or after sampling and collecting; 
comparing, for any set of k attributes and number of iterations of sampling 
and coincidence-counting, the observed count versus the expected count of 
20 coincidences, and from this comparison determining a measure of correlation 

(or association, or dependence) for the set of k attributes; and 
reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a set of k of the N A attributes which have been 
determined by this process to have a value for a chosen correlation measure 
25 above a predetermined threshold value. 

In a second aspect the invention provides a coincidence detection method for use with a data 
set of objects having a number of attributes, the method comprising the steps of: 

sampling a subset of the data set for a predetermined number of iterations, 
each iteration the sampled subset of the data set having for each object the 
30 same subset of attributes; 
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detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
5 recording counts of coincidences in each sampled subset of the data set being 

performed before, at the same time or after sampling, detecting and recording 
counts of coincidences in other subsets; 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
10 detecting and recording; 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 
15 • reporting a set of k-tuples of correlated attributes, where a k-tuple of 

correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 

In any of its aspects the comparison of observed and expected counts may be calculated 
using a Chernoff bound on tail probabilities, and counts may be recorded by storing a 
20 running total of the count of each coincidence over all of the sampled subsets. 

In a third aspect the invention provides a method for visual exploration of a data set of 
objects having a number of attributes, the method comprising the steps of: 

sampling a subset of the data set for a predetermined number of iterations, 
each iteration the sampled subset of the data set having the same number of 
objects although not necessarily the same objects and having for each object 
the same subset of attributes; 

detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
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recording counts of coincidences in each sampled subset of the data set being 
performed before, at the same time or after sampling, detecting and recording 
counts of coincidences in other subsets; 

determining an expected count for each coincidence of interest, the 
5 determining being performed before, at the same time, or after sampling, 

detecting and recording; 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
l o attributes for the coincidence; and 

reporting a set of k-tuples of correlated attributes to a user through a 
graphical interface, where a k-tuple of correlated attributes is a plurality of 
attributes for which the measure of correlation is above a respective pre- 
determined threshold. 

15 In a fourth aspect the invention provides a pre-processing method for use with a data 

modelling unit to capture and report to the data modelling unit higher order interactions of a 
data set of objects having a number of attributes, the method comprising the steps of: 

sampling a subset of the data set for a predetermined number of iterations, 
each iteration the sampled subset of the data set having for each object the 

20 same subset of attributes; 

detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset, where the plurality of 
attribute values is the same for each occurrence, the detecting and recording 

25 counts of coincidences in each sampled subset being performed before, at the 

same time or after sampling, detecting and recording counts of coincidences 
in other subsets; 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
30 detecting and recording; 
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comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 
5 • reporting to the data modelling unit a set of k-tuples of correlated attributes, 

where a k-tuple of correlated attributes is a plurality of attributes for which 
the measure of correlation is above a respective pre-determined threshold. 

In a fifth aspect the invention provides a correlation elimination method for use with a data 
set of objects having a number of attributes, the method comprising the steps of: 

sampling a subset of the data set for a predetermined number of iterations, 
each iteration the sampled subset of the data set having for each object the 
same subset of attributes; 

detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
recording counts of coincidences in each sampled subset being performed 
before, at the same time or after sampling, detecting and recording counts of 
coincidences in other subsets; 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
detecting and recording; 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence, and 

eliminating a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 
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In any of the aspects, the objects may be sales transactions, each transaction comprising one 
or more purchased products, and the attributes may be instances of sale of particular 
products or types of products. The objects may be time slices and the attributes may be the 
status of elements in a system. The objects may be time slices and the attributes may be 
5 prices, or price changes of, financial instruments or commodities. 

In any of the aspects the steps of the method may be represented by the following pseudo- 
code: 

0. begin 

1. read (MATRIX); 
10 2. read(R,T); 

3 . compute_first_order_marginals(MATRIX); 

4. csets:={}; 

5 for iter = 1 to T do 
6. sampled_rows :=rsample(R, MATRIX): 
15 7. attributes :-get_attributes(sampled_rows); 

8 . all_coincidences :=find_all_coincidences(attributes); 

9. for coincidence in all_coincidences do 

10. if cset_already_exists(coincidence, csets) 

1 1 . then update_cset(coincidence, csets); 
20 12. else add_new_cset(coincidence, csets); 

13 endif 

14. endfor 

15. endfor 

16. for cset in csets do 

25 17. expected :=compute_expected_match_count(cset); 

18. observed :-get_observed_match_count(cset); 

19. stats :=update_stats(cset, hypothJest(expected, observed)); 
20 endfor 

21. print_final_stats(csets, stats); 
30 22. end 
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In a sixth aspect the invention provides a coincidence detection system for use with a data 
set of objects, each object having a plurality of attributes, the system comprising: 

means for sampling a subset of the data set for a predetermined number of 
iterations, each iteration the sampled subset of the data set having for each 

5 object the same subset of attributes; 

means for detecting, and recording counts of, coincidences in each sampled 
subset of the data set, a coincidence being the co-occurrence of a plurality of 
attribute values in one or more objects in a sampled subset of the data set, 
where the plurality of attribute values is the same for each occurrence, the 

10 detecting and recording counts of coincidences in each sampled subset being 

performed before, at the same time or after sampling, detecting and recording 
counts of coincidences in other subsets; 

means for determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 

15 detecting and recording; 

means for comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 

20 • means for reporting a set of k-tuples of correlated attributes, where a k-tuple 

of correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 



In the system of the sixth aspect, the means for sampling a subset of the data set may 
comprise means for dividing the data set into subsets for sampling. The means for detecting 

25 and recording counts of coincidences may comprise an array of processing nodes, each 
processing node detecting and recording a respective subcount of coincidences, and the 
means for comparing, for each coincidence of interest, said observed count of coincidences 
to said expected count of coincidences may comprise means for merging said subcounts to 
provide said observed count. At least one of said processing nodes may comprise a 

30 respective subarray of processing nodes that detect and record respective subsubcounts of 
coincidences, and said means for merging merges said subsubcounts to provide said 
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subcounts and/or said observed count. Each processing node may comprise memory 
including an input buffer for storing received subsets of the data set and an output buffer for 
storing the subcount or the subsubcount; and a memory bus that transfers data to and from 
the memory. 

5 

In a seventh aspect the invention provides coincidence detection programmed media for use 
with a computer and with a data set of objects having a number of attributes, the 
programmed media comprising: 

a computer program stored on storage media compatible with the computer, 
10 the computer program containing instructions to direct the computer to: 

sample a subset of the data set for a predetermined number of 
iterations, each iteration the sampled subset of the data set having for 
each object the same subset of attributes; 

detect and record counts of coincidences in each sampled subset of 
15 the data set, a coincidence being the co-occurrence of a plurality of 

attribute values in one or more objects in a sampled subset of the data 
set, where the plurality of attribute values is the same for each 
occurrence, the detecting and recording counts of coincidences in 
each sampled subset being performed before, at the same time or after 
20 sampling, detecting and recording counts of coincidences in other 

subsets; 

• determine an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after 
sampling, detecting and recording; 
25 • compare, for each coincidence of interest, the observed count of 

coincidences versus the expected count of coincidences, and from this 
comparison determine a measure of correlation for the plurality of 
attributes for the coincidence; and 

report a set of k-tuples of correlated attributes, where a k-tuple of 
30 correlated attributes is a plurality of attributes for which the measure 

of correlation is above a respective pre-determined threshold. 
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In an eighth aspect the invention provides a coincidence detection system for use with a data 
set of objects having a number of attributes, the system comprising: 
a ComputerLand 

a computer program on media compatible with the computer, the computer program 
5 directing the computer to: 

sample a subset of the data set for a predetermined number of iterations, each 
iteration the sampled subset having for each object the same subset of 
attributes, 

detect, and record counts of, coincidences in each sampled subset of the data 
10 set, a coincidence being the co-occurrence of a plurality of attribute values in 

one or more objects in a sampled subset of the data set, where the plurality of 
attribute values is the same for each occurrence, the detecting and recording 
counts of coincidences in each sampled subset being performed before, at the 
same time or after sampling, detecting and recording counts of coincidences 
15 in other subsets; 

• determine an expected count for each coincidence of interest, the determining 
being performed before, at the same time, or after sampling, detecting and 
recording, 

compare, for each coincidence of interest, the observed count of coincidences 
20 versus the expected count of coincidences, and from this comparison 

determine a measure of correlation for the plurality of attributes for the 
coincidence, and 

report a set of k-tuples of correlated attributes, where a k-tuple of correlated 
attributes is a plurality of attributes for which the measure of correlation is 
25 above a respective pre-determined threshold. 

In any of its aspects the methods of the invention may further comprise the step of 

representing the objects and attributes in a matrix of objects versus attributes prior to 
sampling the data set, the data set being sampled by sampling the matrix. 
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In a ninth aspect the invetion provides a product having a set of attributes selected by: 

sampling a subset of a data set representing objects versus attributes for a 
predetermined number of iterations, each iteration the sampled subset having 
the same number of objects although not necessarily the same objects and 
having for each object the same subset of attributes, 
detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
recording counts of coincidences in each sampled subset being performed 
before, at the same time or after sampling, detecting and recording counts of 
coincidences in other subsets, 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
detecting and recording, 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence, and 

reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 

In a tenth aspect the invention provides a product defined by applying a set of rules 
generated from: 

sampling a subset of a data set representing objects versus attributes for a 
predetermined number of iterations, each iteration the sampled subset having 
for each object the same subset of attributes, 
• detecting and recording counts of coincidences in each sampled subset of the 
data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
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recording counts of coincidences in each sampled subset being performed 
before, at the same time or after sampling, detecting and recording counts of 
coincidences in other subsets, 

determining an expected count for each coincidence of interest, the 
5 determining being performed before, at the same time, or after sampling, 

detecting and recording, 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
10 attributes for the coincidence, and 

reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 



In any aspect the methods of the invetion may further comprise the step of applying rules 
15 that are defined by the reported correlated attributes. 



In an eleventh aspect the invention provides a peptide or peptidomimetic including a 
structural motif of the V3 loop of HIV envelope protein including spatial coordinates of 
residues A 1 8/Q3 1/H3 3 . 



In a twelfth aspect the inventions provides a pharmaceutical composition comprising a ligand 
20 that interacts with a protein having a structural motif identified using the method of claim 2, 
and a pharmaceutical^ acceptable carrier or exicipient therefor. The ligand may comprise 
chemical moieties of suitable identity and spatially located relative to each other so that the 
moieties interact with corresponding residues or portions of the motif. The ligand, by 
interacting with the motif, may interfere with function of a region of the protein comprising 
25 the motif. 



In a thirteenth aspect the invention provides a diagnostic agent comprising a ligand that 
interacts with a protein having a structural motif identified using the method of the earlier 
aspects of the invention, and a detectable label linked to the ligand. 
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In a fourteenth aspect the invention provides a pharmaceutical composition for interacting 
with an envelope protein of human immunodeficiency virus (HIV), the envelope protein 
including a structural motif of the V3 loop having spatial coordinates of residues 
A18/Q31/H33, comprising a ligand including at least one functional group that interacts with 

5 the motif, and a pharmaceutical^ acceptable carrier or exicipient therefor. The ligand may 
include at least one functional group capable of binding to and being present in an effective 
position in said ligand to bind to residue 18, at least one functional group capable of binding 
to and being present in an effective position in said ligand to bind to residue 31, and at least 
one functional group capable of binding to and being present in an effective position in said 

10 ligand to bind to residue 33. 

In a fifteenth aspect the invention provides a method of designing a ligand to interact with a 
structural motif of an envelope protein of human immunodeficiency virus (HIV), the method 
comprising the steps of: providing a template having spatial coordinates of residues A18, 
Q31 and H33 in the V3 loop of HIV envelope protein, and computationally evolving a 

15 chemical ligand using an effective algorithm with spatial constraints, go that said evolved 
ligand includes at least one effective functional group that binds to the motif The ligand 
may comprise at least one functional group capable of binding to and being present in an 
effective position in said ligand to bind to residue 18, at least one functional group capable 
of binding to and being present in an effective position in said ligand to bind to residue 3 1, 

20 and at least one functional group capable of binding to and being present in an effective 
position in said ligand to bind to residue 33. 

In a sixteenth aspect the invention provides a method of identifying a ligand to bind with a 
structural motif of an envelope protein of human immunodeficiency virus (HIV), the method 
comprising the steps of: providing a template having spatial coordinates of A18, Q31 and 
25 H33 in the V3 loop of HIV envelope protein; providing a data base containing structure and 
orientation of molecules; and screening said molecules to determine if they contain effective 
moieties spaced relative to each other so that the moieties interact with the motif. A first 
moiety of the molecule may interact with residue 18, a second moiety of the molecule 
interacts with residue 3 1 and a third moiety of the molecule interacts with residue 33 . 
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In a seventeenth aspect of the invention the invetion may provide antigens and vaccines 
embodying the covarying k-tuples described herein. 

In an eighteenth aspect the invention provides a product being defined by its interaction with 
a set of attributes selected by: 

5 • sampling a subset of a data set representing objects versus attributes for a 

predetermined number of iterations, each iteration the sampled subset of the 
data set having the same number of objects although not necessarily the same 
objects and having for each object the same subset of attributes, 
detecting, and recording counts of, coincidences in each sampled subset of 

10 the data set, a coincidence being the co-occurrence of a plurality of attribute 

values in one or more objects in a sampled subset, where the plurality of 
attribute values is the same for each occurrence, the detecting and recording 
counts of coincidences in each sampled subset being performed before, at the 
same time or after sampling, detecting and recording counts of coincidences 

15 in other subsets, 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
detecting and recording, 

comparing, for each coincidence of interest, the observed count of 
20 coincidences versus the expected count of coincidences, and from this 

comparison determining a measure of correlation for the plurality of 
attributes for the coincidence, and 

reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure or 
25 correlation is above a pre-determined threshold. 



In any of the aspects the objects may be compounds and the attributes may comprise 
particular chemical moieties. The objects may be peptides or proteins and the attributes may 
comprise particular structural or substructural patterns or motifs. The objects may be 
selected from the group consisting of compounds, molecular structures, nucleotide 
30 sequences and amino acid sequences and the attributes may be features of the selected 

-20- 



SUBST1TUTE SHEET (RULE 26) 

BNSDOCID: <WO 9B43182A1_I_> 



WO 98/43182 



PCT/CA98/00273 



objects. The objects may be time slices and the attributes may be biological parameters of 
genes or gene products. The objects may be documents that are electonically stored and/or 
electronically indexed and the attributes may be topics. The objects may be customers and 
the attributes may comprise products purchased or not purchased by those customers. The 

5 attributes may further comprise mailings made or not made to the customers. The objects 
may comprise products and the attributes may comprise customers that have or have not 
purchased those products. The attributes may further comprise demographic variables of the 
customers. The objecis may be people with a particular disease or disorder and the 
attributes may be potential contributing factors for the disease or disorder. The objects may 

10 be people with a number of different diseases or disorders and the attributes may be potential 
contributing factors for the diseases or disorders. The objects may comprise factors 
potentially contributing to a disease or disorder and the attributes may be people with or 
without those factors, in which case the method associates groups of people of substantially 
equivalent risk for the disease or disorder. 

15 The objects may be time slices and the attributes may comprise the state of components in a 
system at time slices prior to failure of the system, in which case the method associates 
component states that may potentially cause failure of the system. 

In the first aspect r s may be the same for every iteration. 

In any of the aspects the method provided may further comprise the steps of first creating a 
20 database of transitions between system states, wherein a system state is represented by a 

value of a state variable, over a chosen time quantum, and presenting the database, in whole 
or pan, as a data set such that each state to state transition set corresponds to one of M 
objects and so that each state variable corresponds to an attribute. 

In any of its aspects the method provided may further comprise the steps of first creating a 
25 database of states and actions covering a chosen time quantum and presenting the database, 
in whole or part, as a data set such that each state/action/state triple corresponds to one of 
M objects and so that each state variable or action type corresponds to an attribute. 
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In a nineteenth aspect the invention provides a coincidence detection method for use with a 
data set of objects having a number of attributes represented in a matrix of objects versus 
attributes, the method comprising the steps of: 

sampling a subset of the matrix for a predetermined number of iterations, 
5 each iteration the sampled subset of the matrix having for each object the 

same subset of attributes; 

detecting, and recording counts of, coincidences in each sampled subset of 
the matrix, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the matrix, where the 

10 plurality of attribute values is the same for each occurrence, the detecting and 

recording counts of coincidences in each sampled subset being performed 
before, at the same time or after sampling, detecting and recording counts of 
coincidences in other subsets; 
• determining an expected count for each coincidence of interest, the 

15 determining being performed before, at the same time, or after sampling, 

detecting and recording; 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
20 attributes for the coincidence; and 

reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 

In the first aspect numerical correlation values may be reported along with the set of k-tuples 
25 of correlated attributes. 
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BRIEF DESCRIPTION OF DRAWINGS 

For a better understanding of the present invention and to show more clearly how it 
may be carried into effect, reference will now be made, by way of example, to the 
accompanying drawings which show the preferred embodiment of the present invention and 
5 in which: 

Figure 1 is a depiction of a power set of a set with N=6 objects, arranged as a lattice 
under a subset operation, representing all possible K-triples of coluns from the power set. 

Figure la is a depcition of the relative portions of all lattice nodes shown (dark 
squares) or omitted (light squares) by Figure 1 . 

10 Figure 2 is a depiction of n-grams for all sizes n = 1,2,... ,6 for the power set of 

Figure 1. 

Figure 2a is a depiction of the relative portion of all lattice nodes shown or omitted 
in Figure 2 with a subset of the terms highlighted. 

Figure 3 is a depiction of all possible pairwise correlations for the power set of 
15 Figure 1, corresponding to analysis of the third tier up from the bottom of the lattice. This is 
a shortcut taken in work on inter-residue correlations in protein and RNA sequence families, 
for example. In another example, this Figure represents the approach taken by a method 
that simply finds all pairs of sales items that tend to be purchased together by consumers. 

Figure 3 a illustrates the relevant correlations from Figure 3 out of the powerset of 
20 Figure 1. 

Figure 4 is a depiction of a partition of the variables of the objects of the power set 
of Figure 1 . A partition is one particular and important kind of componential model of a 
sequence family or other aligned dataset. In a componential model, a set of N Y latent^ 
variables is found to "generate" or "explain" a larger set of N observable variables c,. In a 
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partition model, N Y £ N, each Cj is generated by exatly one of the^, and typically N Y < N. 
The observables corresponding to one latent variable form a kind of clique, and presumably 
are highly correlated with each other and relatively uncorrected with variables outside the 
clique. In Figure 4, the observables are formed into three cliques: CC„ (C 2 , C 5 , C 6 ), and (C 3j 
5 C 4 ). 

Figure 4a illustrates the partition of Figure 4 out of the power set of Figure 1 . 

Figure 5 is a depiction of three iterations of sampling of a dataset in accordance with 
one embodiment of the invention. 

Figure 5 A is a depiction of the three iterations of sampling of Figure 5 with 
10 explanatory notes, 

Figure 6 is a general flow diagram of a program method of a preferred embodiment, 

Figure 7 is a schematic diagram of a system implementing the program method of 
Figure 6, 

Figure 8 is a general flow diagram of the program method of Figure 6 adapted to 
15 control a process for production of a product, 

Figure 9 is a schematic diagram of a system implementing the adapted program 
method of Figure 8, 

Figure 10 is a general flow diagram of the program method of Figure 6 adapted to 
generate rules for a rules based system that in turn produces a product, 

20 Figure 1 1 is a schematic diagram of a system implementing the adapted program 

method of Figure 10, 
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Figure 12 is a general flow diagram of the program method of Figure 6 adapted to 
generate rules used to control a process for production of a product, 

Figure 1 3 is a schematic diagram of a system implementing the adapted program 
method of Figure 12, 

5 Figure 14 is a diagram of a node of a hardware implementation of a preferred 

embodiment. 

Figure 15 is a diagram of residues for given sequences for the sample 3D structure of 
Figure 15a where coincidence of sequences may indicate conserved? physical or structural 
relationships. 

10 Figure 15a is a diagram of a 3D structure for a sample protein. 

Figure 16 is a diagram of steps in tertiary structure prediction which can employ the 
methods described herein. 

MODES FOR CARRYING OUT THE INVENTION 

As previously set out, a base method described herein employs the steps of: 

15 • representing a set of M objects in terms of a number N A of variables 

("attributes"), where an attribute is said to occur in an object if the object 
possesses the attribute; 

sampling a subset of r ; out of the M objects, for each iteration among a 
predetermined number of iterations; 
20 • detecting and recording coincidences among sets of k of the attributes in each 

sampled subset of objects, a coincidence being the co-occurrence of 1 ^ k < 
N A attributes in the same \\ out of t { objects in the sampled subset, where 0 <> 
hi * IV, 
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• determining an expected count of coincidences for any set of k attributes and 
a predetermined number of iterations of sampling and coincidence-counting 
as described above, the determining being performed before sampling and 
collecting, at the same time or after sampling and collecting; 
5 • comparing, for any set of k attributes and number of iterations of sampling 

and coincidence-counting, the observed count versus the expected count of 
coincidences, and from this comparison determining a measure of correlation 
(or association, or dependence) for the set of k attributes; and 
reporting a set of k-tuples of correlated attributes, where a k-tuple of 
10 correlated attributes is a set of k of the N A attributes which have been 

determined by this process to have a value for a chosen correlation measure 
above a predetermined threshold value. 

An alternative base method can include the following steps: 

sampling a subset of the data set for a predetermined number of iterations, 
each iteration the sampled subset of the data set having for each object the 
same subset of attributes; 

detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
recording counts of coincidences in each sampled subset of the data set being 
performed before, at the same time or after sampling, detecting and recording 
counts of coincidences in other subsets; 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
detecting and recording; 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 
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reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 

The modes described herein provide extensions to the base methods described above 
and employ similar principles. The principles of one application as described herein may be 
applied to the others as appropriate. Thus, the description of all elements of an application 
will not always be repeated for each application. 

In the preferred embodiment it is preferred for simplicity of programming and 
interpretation to use a matrix where the objects are rows and the attributes are columns; 
however, this is not strictly required and any of the embodiments can utilize a data set of 
objects and attributes that are not represented in the form of a matrix by sampling subsets of 
the data set directly. As known to persons skilled in the art, any relational database can be 
easily transformed into a 2-dimensional matrix format. 

The embodiments described herein lend themselves particularly well to parallel 
processing as the steps of detecting, recording and counting coincidences for each of the r 
samples can be performed simultaneously across many different samples or other subsets of 
the data set. 

Each of the features or variables describing an object may be numerical or 
qualitative. If qualitative, a feature or variable described in terms of some number z of levels 
or qualities may be transformed into a numerical variable with z possible values or states. A 
numerical variable with z possible values or states may be transformed into z binary 
variables, termed attributes. A numerical variable or feature with a continuous range of 
possible values or levels may be transformed into, or represented by, a variable with z 
possible values or states and therefore may also be transformed into, or represented by a set 
of z binary attributes. 

More formally, assume that we are given a database of M objects O u 0 2 ,...,0 M each 
of which is characterized by particular values a^Aj for each of TV discrete- valued variables 
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v f A particular value for a particular variable is denoted a, @v y . One may start with 
continuously-valued variables and use any of several known methods to quantize them into 
discrete variables. We also note that, in many applications, the same alphabet A of possible 
values is used for all the variables. Each object might be a particular record in a database, or 
may be a sample from a random source. 

If the initial N variables are not binary then they can be converted into a set of N A 
attributes. For example, in the input listing attached in Appendix "B" each amino acid 
position is a variable that has 20 possibilities corresponding to the 20 naturally occurring 
amino acids represented by a subset of letters from the alphabet. In order to turn the 
variables into binary attributes, each variable becomes 20 different attributes having 1 of 2 
states, such as "A" or "not A", "B" or not *'B", and so on. An embodiment for representing 
variables of this type is included in the source code listing in Appendix "A". Other 
techniques for representing data as attributes could be used. 

The principles set out in this description can also be extended to higher orders of 
attributes, for example trinary attributes to be used with higher order computing machines. 
The binary examples used herein are the simplest to implement. 

This situation can be represented by a table in which each row stands for an object, 
each column stands for an attribute, and in which therefore each table entry a i} stands for the 
fact of the ith object having value written at for the jth variable. We can also write c, (for 
"column /') and an attribute as a t @c r 

For example, consider this small matrix of six rows (objects) and six columns 
(variables). 



coll 


co/2 


coli 


col4 


col5 


col6 
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U 
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iB 


c 
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1 It 
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Object number 1 has value 'A' for variable 1, 'B' for variable 2, 'C for variable 3, 
and so on. For some applications, it might be useful to find out that, for example, variables 
2 and 4 are correlated. In the toy (small fictional) matrix example above, this correlation 
appears plausible, because whenever an object has B@2, it also has D@4; whenever an 
5 object has L@2, it has M@4; and whenever an object has U@2, it also has V@4. Attribute 
number 3 does not vary - every object has the attribute C@3, and therefore it does not 
correlate in an interesting way with any other variable. 

Given a matrix of data, we further assume that there is some "true" underlying 
probability distribution q( ) which, for all orders 4=1,2,...,^ specifies the probabilities 
10 for each possible k-tuple of attributes. For example, for k = 1, we have q(Cj) : Aj - [0, 1], 
and we might have for some dataset q(B@2) = 0.33. A distribution also specifies higher- 
order probabilities, like, for example, q(B@2, F@6) = 0.166. Inherent in the particular 
problems posed is the problem of estimating or approximating the distribution q( ), or at 
least parts of it. 

15 The problem is to find some, or all, k-tuples of columns (c, 7 , c J2i . . c Jk ), for k=2 . . 

,N Ay whose correlation is greater than some predetermined value. For example, one may 
want a procedure which, given an M~by-N table of values, returns a list of k-tuples of 
column indices (J t J 29 . . .J k} such that £>(?(v, 7 , v J2 , . . v Jk \ II M ?(v,,)) > p k for some real 
number p k . Here D(p } \p 2 ) is the Kullback divergence measure, which in this case estimates 

20 the difference between the observed distribution of values over the column variables versus 
the distribution wherein all the column variables are statistically independent. The Kullback 
measure is just one of many possible measures of correlation or association applicable to this 
type of problem. 

For our purposes we consider correlation in terms of deviation from statistical 
25 independence. One can compare an observed number of occurrences of some event in 

viewing the database versus the number expected if an underlying hypothesis of independent 
variables were true. That is, the problem is: Given the table of values, for all k= 2. .N A , 
return a list of all k-tuples of attributes (a n @c n , a a @c a ,. . . , a ik @c ik ) such that 
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P(Observed(a n @c n , a a @c a ,. . . , a ik @c ^Independent (c rt , c, 7 ,. . , c ik \ Model) < 

for some observed behaviour of (a n @c n , c*a@ c a>- @ c ik)> f° r some real number 

threshold 0, e [0, 1], and some Model which underlies one's estimation or hypothesis testing 
method. 

The sampling subprocess may be random sampling, and if random it may be subject 
to any of a number of possible probability distributions over the objects, including a uniform 
distribution. Similarly, there may be constraints on the statistical independence or 
dependencies between each of the T samples drawn during the operation of the method, and 
between each of the r objects drawn within one sample. 

Sample Advantages of Preferred Embodiments 

There is at least one class of problems, arising in many diverse application areas, on 
which the comparative advantages of the coincidence detection method and apparatus 
described above and further to be described below are most apparent. Such problems are 
characterized by: 

1 . a large number of attributes (columns, in our representation); 

2. the possible existence of some number of cliques of highly mutually 
correlated attributes in the dataset, each member attribute of each such clique being 
relatively uncorrelated with attributes outside its own clique; and 

3 . lack of prior knowledge as to the precise number, width (£, as in k-ary 
correlation and Ath-order feature), and location of such attribute cliques. 

All other procedures of which we are aware either place prior limitations on the 
width k of discoverable k-tuples, or implement an exhaustive search, serial or parallel, over 
all or nearly all possible k-tuples of attributes. To put it more simply, the method of the 
preferred embodiment takes approximately the same computation time and memory to find a 
44-ary correlation as it takes to find a 2-ary correlation in the same very high dimensional 
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dataset. Most prior methods, in contrast, either rule out the discovery of the 44th-order 
feature or else require the allocation of orders of magnitude more time or space in order to 
find it. 

Sample Applications of Preferred Embodiments 

5 Modellers of very large data sets are thwarted in their attempts to compute very far 

into a fully higher-order probabilistic model by both the computational complexity of the 
task and by the lack of data needed to support statistically significant estimates of most of 
the higher-order terms. 

The preferred embodiment computes only a subset of higher-order probabilities, and 

10 extracts a limited selection of higher-order features ("HOFs") for construction of a database 
model. Efficient use can be made of limited computing resources by pre-selecting sets of 
higher-order features using the correlation-detection methods described herein, and building 
the most significant (statistically and in terms of application-specific criteria) into model- 
based classifiers and predictors based on existing statistical, rule-based, neural network, or 

15 grammar-based methods. The pre-selected sets of HOFs can be used to create rules for such 
systems. For example, a data set may be analysed using the methods set out herein to 
determine that if a company is filing a patent application then it should file an assignment 
from the inventor. This rule is then used in the system to generate assignments whenever it 
is determined that a company is filing a patent application. Many rule-based networks could 

20 benefit from pre-processing using the methods described herein, see for example, the System 
and Method for Building a Computer-Based Rete Pattern Matching Network of Grady et al. 
described in U. S. Patent Number 5,159,662 issued October 27, 1992; the inference engine 
of Highland et al. described in U. S. Patent Number 5,119,470 issued June 2, 1992; and the 
Fast Method for a Bidirectional Inference of Masui et al. described in U. Patent Number 

25 5,179,63 2 issued January 12, 1993. 

The discovered HOFs can alternatively be used directly to create products, for 
example, in the prediction or determination of protein structure, when fed into existing 
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methods based on distance geometry or empirically-estimated patterns of cooperativity and 
folding, or in marketing schemes based on correlated product sales information. 

Later below, practice of the principles described herein using the Los Alamos HIV 
Database is described. In particular, the principles were applied to study of the V3 loop of 
5 envelope proteins of human immunodeficiency virus (HIV). In biochemistry and molecular 
biology in general, covariation of particular residues of a protein likely indicates the 
existence of a structural motif characterizing a region of the protein that has a functional, 
physiological role. 

Envelope proteins are partially embedded in the lipid membrane surrounding a virus 
10 particle, and project externally from the lipid. When the lipid of an HIV particle fuses with 
the membrane of a host cell during infection, envelope proteins may also protrude from the 
membrane of the infected cell. The V in V3 stands for "variable", as the sequence of the V3 
loop is highly variable between different virus isolates. 

Previously, a Los Alamos group in BT.M. Korber, R.M. Farber, D.H. Wolpert and 
15 AS. Lapedes, "Covariations in the V3 loop of HIV- 1: An information-theoretic analysis", 
Proc. Nat. Acad. Sci. U.S.A. 90 (1993), the disclosure of which is hereby incorporated 
herein by reference, described 2-ary covariation mutations in certain residues of the V3 loop 
of HIV 1 envelope proteins. Practice of the present principles has confirmed some of the 
Los Alamos group's results, but has further permitted the discovery of other highly 
20 covarying groups of residues. Whereas the Los Alamos group could only discover pairwise 
covariation, we describe herein k-ary residue covariation, where k> 2. That is, we have 
identified previously unrecognized motifs of HIV envelope protein. 

For a particular trial, input consisted of the respective amino acid sequences of V3 
regions from 657 different virus isolates, and is shown in Appendix "B". Source code used 
25 on the input is shown in Appendices "A" and "D", named "File coinc.pl" and "File 
probsort.pl", respectively. Output is shown in Appendix "C". 
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Referring to Tables C.l through C.9 set out elsewhere below, the results of 6 
separate trials are shown. Parameter values are as indicated in the respective legends. In each 
Table, the results are ordered by statistical significance, with the most significant correlation 
first, and the standard one-letter amino acid code is employed. Thus, referring to Table C.6, 
the most significant coincidence observed is the occurrence of alanine (A) at residue 18, 
glutamine (Q) at residue 31, and histidine (H) at residue 33. This, like other coincidences set 
forth on the cited pages, represents the identification of a structural motif of the HIV-I V3 
loop which comprises these residues. 

Continuing with the particular example of A18/Q31/H33, the V3 structural motif 
comprising these residues presumably exists on the exterior of the virus particle, and that 
region of the V3 loop likely performs a specific function which requires the particular 
structural motif. Thus, the structural motif would have to be conserved after mutation(s) to 
preserve that function. This reasoning is extended to other coincidences identified herein. 

The identification of a particular conserved structural motif of HIV has several uses. 

Using techniques known in the art, a peptide embodying the motif could be produced 
for use as an antigen. Accordingly, a vaccine could be prepared. The peptide embodying the 
motif might be made using known recombinant methods, as are described generally, for 
example, in Maniatis et al., Molecular Cloning: A Laboratory Manual, Cold Spring Harbor 
Laboratory, Cold Spring Harbor, NY (1982) and in Sambrook et al., Molecular Cloning: A 
Laboratory Manual (2 nd Edition), Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 
(1989). Alternatively, the peptide or a peptidomimetic might be chemically synthesized 
using standard chemical techniques. Monoclonal antibodies to the peptide or 
peptidomimetic could be generated using standard methods, as described for example, in 
Harlow, E and Lane, D., Antibodies: A Laboratory Manual, Cold Spring Harbor 
Laboratory, Cold Spring Harbor, NY (1988). Fragments of such monoclonal antibodies, for 
example, F ab fragments, that have specific affinity for the novel structural motif could also be 
generated. 
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In another embodiment, a ligand that interacts with a structural motif identified 
according to the invention could be generated. That is, the ligand would be characterized by 
having chemical moieties of suitable identity and spatially located relative to each other so 
that the moieties interact with corresponding residues or portions of the motif In some 
embodiments, the ligand could be an agent, eg. a drug) that, by binding to the motif; 
interferes with function of the region. The ligand would therefore be an HTV antagonist 
with potential therapeutic utility. Alternatively, the ligand could bind to the particular V3 
region comprising the identified motif, providing diagnostic utility. Such diagnostic utility 
can be ex vivo. A ligand with diagnostic utility (e.g., an antibody) might comprise a label, 
such as a fluor or an enzyme conjugate for use in a colorimetric reaction. Fluorescence- 
labelled viruses or virus-infected cells could be visualized or counted using fluorescence 
microscopy or FACS (fluorescence-activated cell sorting). 

Methods of designing and identifying ligands that bind to structural motifs identified 
according to the invention are also provided by the invention. 

Thus, in one embodiment, the invention provides a ligand for binding with an 
envelope protein of human immunodeficiency virus (HIV), wherein the envelope protein 
includes a structural motif comprising amino acid residues A18/Q3 1/H33. The ligand 
includes at least one functional group capable of binding to the motif. In a preferred 
embodiment, the ligand includes at least one functional group capable of binding to and 
being present in an effective position in said ligand to bind to residue 1 8, at least one 
functional group capable of binding to and being present in an effective position in said 
ligand to bind to residue 31, and at least one functional group capable of binding to and 
being present in an effective position in said ligand to bind to residue 33. 

In another embodiment, the invention provides a method of designing a ligand to 
bind with a structural motif of an envelope protein of human immunodeficiency virus (HIV). 
The method includes providing a template having spatial coordinates of A 18, Q31 and H33 
in the V3 loop of HIV- 1 envelope protein, and computationally evolving a chemical ligand 
using an effective algorithm with spatial constraints, so that said evolved ligand includes at 
least one effective functional group that binds to the motif In a preferred embodiment, the 
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ligand includes at least one functional group capable of binding to and being present in an 
effective position in said ligand to bind to residue 1 8, at least one functional group capable 
of binding to and being present in an effective position in said ligand to bind to residue 31, 
and at least one functional group capable of binding to and being present in an effective 
position in said ligand to bind to residue 33. 

In another embodiment, the invention provides a method of identifying a ligand to 
bind with a structural motif of an envelope protein of human immunodeficiency virus (HIV). 
The method includes: providing a template having spatial coordinates of A18, Q31 and H33 
in the V3 loop of HIV- 1 envelope protein; providing a data base containing structure and 
orientation of molecules; and screening said molecules to determine if they contain effective 
moieties spaced relative to each other so that the moieties interact with the motif. In a 
preferred embodiment, a first moiety of the molecule interacts with residue 3 1, a second 
moiety of the molecule interacts with residue 3 1 and a third moiety of the molecule interacts 
with residue 33. 

The principles described herein encompass similar respective embodiments, including 
antigens and vaccines, for the other covarying k-tuples described herein, that is, both 
residues of the V3 loop that covary, and particular amino acids at certain residues that 
covary. 

The method of the current invention can be viewed as a "high-pass filter" for 
detection of higher-order features. Such HOFs play an important role in database modelling, 
machine learning, and perception and pattern-recognition. In database mining and modelling 
contexts, a procedure for discovery of these features might serve any of several major roles, 
including: 

1 . Preprocessing of large, complex datasets: Many of the best modelling 
methods, including Gibbs models, Hidden Markov Models and EM, MacKay's 
density networks, and related factorial learning methods from the neural network 
community, could be helped significantly in capturing higher-order interactions 
without exhaustive search or combinatorial explosion of parameter space if preceded 
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by a fast preprocessing procedure, such as one provided by implementing the 
principles described herein, that found plausibly correlated variables in the database. 

2. Visual exploration of large complex data sets: If coupled to even a simple 
graphical display interface, a procedure such as ours permits a user to view quickly 

5 (with small number of r-samples) the most plausibly interesting higher-order features 

in high-dimensional data. 

3. Pre-conditioning and redundancy elimination: Thus far, we have stressed 
the utility of finding inter-attribute correlations in order to use them in the building of 
models; but in many optimization, learning and data-fitting applications, one requires 

10 that correlations between variables be found and eliminated, through any of a 

number of subspace methods like principal components analysis (PCA). 



An Embodiment Using a Programmable Digital Computer 
Components for Digital Computer Embodiment 

Data Matrix, Sampling, and Coincidences. Given a set of M objects, each of 
15 which has either a "Yes" (representable by 1) or "No" (representable by 0) value for each of 
a fixed set of N A attributes, the input dataset can be arranged into an M-by-A^ table of 
values, which we shall call the data matrix or simply matrix, and this matrix, as well as its 
sub-matrices and related vectors that comprise functional parts of the system/process 
described below, are stored in memory locations within a programmable computer. In this 
20 representation the rows of the matrix correspond to objects, and the columns correspond to 
attributes. The matrix may be labelled as V fj and each element of this two-dimensional table 
labelled by v (> e {0, 1 }, where / refers to the /th object (row) o, and j refers to the yth 
attribute (column) a . The set of objects may be listed, for the purposes of this description, 
as O = o u o 2j . . ., o M and the set of attributes may be listed as A = a u a 2 , . . a m . 
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This formula gives an estimate of the probability for finding exactly h occurrences of 
a iU h occurrences of a i2y . . and h occurrences of a ih all occurring in the same h rows, in one 
r-sample. 

(This function definition has a simple form because all but two of the large number of 
5 pQ factors in the standard multinomial expression vanish with zero exponents.) 

The probability of a match of size h for the k attributes which make up a potential cset 
has been defined in terms of the joint probability p(a iU . . ., the Expected Count Function 
must employ particular estimates for these joint probabilities. In this preferred embodiment, 
the joint probability estimates incorporate the hypothesis of independence between the 
10 individual attributes. Therefore in the definition formula given above we substitute II/^ pfa f7 ) 
forp( a iU a lk ) and n,* =1 (1-pfa,,)) for pQL n , . a iJt ). 

Hypothesis Test Function and Correlation Measure. An hypothesis test is a 
mathematical procedure, implemented as a computer program or subroutine, or in special 
purpose electronic and/or optical hardware, which takes a pair of number and 
15 representing the expected and observed numbers of coincidences, respectively, for a particular 
set of k attributes, and produces a number C representing an estimate of the correlation among 
the k attributes. 

In some preferred embodiments, a ChernofF bound on tail probabilities provides the 
hypothesis test function, as described below. 

20 Let random variable X t hold the value h, for each iteration /, and let X = 2 r , =1 X n and 

note that 0 < X < T • r. The method of ChernofF-HoefFding bounds [8] provides the following 
theorem: 

Let random variable X { hold the value h { for each iteration /, and let X= S r I=1 X i9 and 
note that 0 < X < T-r. The method of ChernofF-HoefFding bounds [8] provides the following 
25 theorem: 

Let X = X x + X 2 + • • •+ X„ be the sum of n independent random variable s, where /, < X t 
< u i for reals /, ("lower") and u i ("upper'*). 
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Then 



■25 2 

P[X- E[X] > 5] < exp ( Vfa-iy ) . (1) 



For our purposes, we set n = T and /, = 0 and w, = r, for all / = 1, 2, . . T y and we 
5 thereby obtain 

-2fi 2 

P[X-E[Xl>6]^exp(S^7 ) (2) 

Using this mathematical relationship, an effective procedure for computing a 
correlation value can be defined: 

Corr(a)= 1 -exp(2,r, 2 ). 



In the special case wherein the same sample size r is used for every iteration of the 
sampling, that is, when r y = r for all i J = 1, 2, . . ., 7, then the above formulas reduce to the 
simpler forms: 

1 5 -26 2 

P[X- E[X] > 6] < expfTr 2 * ) (2a) 

Corr (a) = 1 - exp ( Tr 2 ) . 



Here the correlation value corresponds to an estimate of 1 minus the probability of 
20 having observed coincidences, over T iterations of r-sampling, if the hypotheses 

underlying the expected count were true. If the assumption of independence between the 
attributes was used to compute as described above for some preferred embodiments, then 
this hypothesis test provides a correlation value for each cset that estimates the deviation from 
independence; that is, it estimates the statistical dependence between the attributes making up 
25 the cset. 



Operation of the Components Within a Process 



Typically, the representation component is performed first within the overall process of 
the current invention. A plurality of sampling iterations is performed on the representation of 
the data, and for each r-sample, the detection and recording of coincidences is performed. 
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The sampling iterations may be performed sequentially or in parallel, or in some combination 
of sequential and parallel steps. 

At any stage within the process, the determining of an expected count of coincidences, 
for some or all of the coincident sets of attributes, is performed. This component of the 
5 process may be performed all at once for all coincident sets, or incrementally; sequentially or 
in parallel, or in some combination. It may be performed for coincident sets (csets) as each 
coincidence is detected or stored, or may be performed before or after such detection or 
recording. 

After some number of sampling iterations has been performed, the comparing of actual 
10 to expected number of coincidences may be performed for some or all recorded coincident 
sets. This may be done for all csets at once, or for any subsets of them at different points 
throughout the process. These comparisons for different csets may be performed sequentially 
or in parallel, or in some combination thereof. 

After some number of sampling iterations has been performed, the reporting of sets of 
15 correlated attributes may be performed for some or all of the recorded coincident sets that 
have been determined, in the comparisons, to signal significant correlations between the 
component attributes. This may be done for all csets at once, or for any subsets of them at 
different points throughout the process. These comparisons for different csets may be 
performed sequentially or in parallel, or in some combination thereof 

20 Program Method Description of a Preferred Embodiment 

Below is shown, in pseudocode, a program on appropriate media, for example, a 
floppy disk, hard drive, RAM or other such media, corresponding to one possible embodiment 
on a programmable digital computer. 

Figure 5 provides a pictorial example of the applicatin of this embodiment to a fictional 
25 toy dataset. Three iterations of r-sampling (for r = 3) on the toy dataset are depicted, top to 
bottom. For each iteration, the left-hand box represents the dataset, with outlined entries 
representing the sampled rows. The right-hand-box represents the set of bins into which the 
attributes collide. For example, in the first iteration, A@l, B@2, and D@4 all occur in the 
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In one particular embodiment of the invention, the function f match (a. h t r) is obtained 
from the multinomial distribution: 

r! 

/«* («, V) = ( *!(r-A)! )/>(a fl> . ajp& u . . «*T*), 

This formula gives an estimate of the probability for finding exactly h occurrences of 
a n , h occurrences of a i2 , . . and h occurrences of a ik> all occurring in the same h rows, in 
one r-sample. 

(This function definition has a simple form because all but two of the large number of 
pQ factors in the standard multinomial expression vanish with zero exponents.) 

The probability of a match of size h for the k attributes which make up a potential 
cset has been defined in terms of the joint probability p(a iU . . a*); the Expected Count 
Function must employ particular estimates for these joint probabilities. In this preferred 
embodiment, the joint probability estimates incorporate the hypothesis of independence 
between the individual attributes. Therefore in the definition formula given above we 
substitute lif^pfat) forp( a m . . a ik ) and II/ M (l-pfo/)) forp(a,„ . a*)- 

Hypothesis Test Function and Correlation Measure. An hypothesis test is a 
mathematical procedure, implemented as a computer program or subroutine, or in special 
purpose electronic and/or optical hardware, which takes a pair of number and 
representing the expected and observed numbers of coincidences, respectively, for a 
particular set of k attributes, and produces a number C representing an estimate of the 
correlation among the k attributes. 

In some preferred embodiments, a Chernoff bound on tail probabilities provides the 
hypothesis test function, as described below. 

Let random variable X i hold the value h { for each iteration /, and let X= S r )= i X h and 
note that 0 <> X < T r. The method of Chernoff-HoefFding bounds [8] provides the 
following theorem: 
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Let random variable X, hold the value h t for each iteration /, and let X = 2 r (=1 X„ and 
note that 0 <> X <. T -r. The method of Chernoff-Hoeffding bounds [8] provides the 
following theorem: 

L et x= X x + X 2 + • * •+ X„ be the sum of n independent random variable s, where /, < 
5 X t <; w, for reals /, ("lower") and w, ("upper"). 

Then 

-26 2 

P[X- E[X] > 6] < exp ( Zto-rf ) . (1) 

For our purposes, we set n = Tand /, = 0 and w, = r i for all i = 1, 2, . . T, and we 
10 thereby obtain 

-25 2 

P[JT-E[A1>8] s exp(S777 ) (2) 

Using this mathematical relationship, an effective procedure for computing a 
correlation value can be defined: 

Corr(a)=\ -exp(2,r, 2 ). 

In the special case wherein the same sample size r is used for every iteration of the 
sampling, that is, when r, = r for all i ; = 1 , 2, . . 7, then the above formulas reduce to the 
simpler forms: 



20 



-26 2 

P[X-E[X] > 6] < expCTF ) (2a^ 
Corr (a) = 1 - exp { Tr 2 ) . 



Here the correlation value corresponds to an estimate of 1 minus the probability of 
25 having observed coincidences, over T iterations of r-sampling, if the hypotheses 

underlying the expected count were true. If the assumption of independence between 
the attributes was used to compute as described above for some preferred embodiments, 
then this hypothesis test provides a correlation value for each cset that estimates the 
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deviation from independence; that is, it estimates the statistical dependence between the 
attributes making up the cset. 

Operation of the Components Within a Process 

Typically, the representation component is performed first within the overall process 
of the current invention. A plurality of sampling iterations is performed on the 
representation of the data, and for each r-sample, the detection and recording of 
coincidences is performed. The sampling iterations may be performed sequentially or in 
parallel, or in some combination of sequential and parallel steps. 

At any stage within the process, the determining of an expected count of 
coincidences, for some or all of the coincident sets of attributes, is performed. This 
component of the process may be performed all at once for all coincident sets, or 
incrementally; sequentially or in parallel, or in some combination. It may be performed for 
coincident sets (csets) as each coincidence is detected or stored, or may be performed before 
or after such detection or recording. 

After some number of sampling iterations has been performed, the comparing of 
actual to expected number of coincidences may be performed for some or all recorded 
coincident sets. This may be done for all csets at once, or for any subsets of them at 
different points throughout the process. These comparisons for different csets may be 
performed sequentially or in parallel, or in some combination thereof 

After some number of sampling iterations has been performed, the reporting of sets 
of correlated attributes may be performed for some or all of the recorded coincident sets that 
have been determined, in the comparisons, to signal significant correlations between the 
component attributes. This may be done for all csets at once, or for any subsets of them at 
different points throughout the process. These comparisons for different csets may be 
performed sequentially or in parallel, or in some combination thereof. 

Program Method Description of a Preferred Embodiment 
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Below is shown, in pseudocode, a program on appropriate media, for example, a 
floppy disk, hard drive, RAM or other such media, corresponding to one possible 
embodiment on a programmable digital computer. 

Figure 5 provides a pictorial example of the applicatin of this embodiment to a 
5 fictional toy dataset. Three iterations of r-sampling (for r = 3) on the toy dataset are 

depicted, top to bottom. For each iteration, the left-hand box represents the dataset, with 
outlined entries representing the sampled rows. The right-hand-box represents the set of 
bins into which the attributes collide. For example, in the first iteration, A@l, B@2, and 
D@4 all occur in the first and second of the three sampled rows, so they each have incidence 
10 vector 1 10 and collide in the bin labelled by that binary address. Bins containing only a 
single attribute are ignored; and "empty" bins are never created at all. All bins are cleared 
and removed after each iteration, but collisions are recorded in the Csets global data 
structure. 

Procedure to find correlated sets of attributes: 
15 0. begin 



1. read (MATRIX); 

2. read (R, T); 

3 . compute_first_order_marginals(MATRIX); 

4. csets :={}; 

20 5. for iter = 1 to T do 

6. sampled_rows ^rsampleCR, MATRIX): 

7 . attributes : =get_attributes(sampled_rows) ; 

8. all_coincidences :=find all_coincidences(attributes); 

9. for coincidence in all coincidences do 

25 10. if cset_already_exists(coincidence, csets) 

1 1 . then update_cset(coincidence, csets); 

12. else add_new_cset(coincidence, csets); 
13 endif 

14. endfor 
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15. endfor 

16. for cset in csets do 

1 7 . expected :=compute_expected_match_count(cset); 

18. observed :=get_observed_match_count(cset); 

5 19. stats :=update_stats(cset, hypoth_test(expected, observed)); 

20. endfor 

2 1 . print_final_stats(csets, stats); 



22. end 

Steps 5 through 21 of the pseudo-code represents the steps of the base method described 
10 herein, namely: 

• sampling a subset of the matrix for a predetermined number of iterations, each subset 
of attributes being the same, 

detecting and recording counts of coincidences of attributes in each sampled subset, 
a coincidence being the occurrence of a plurality of attributes in an object in a 
15 sampled subset, where the plurality of attributes is the same for each occurrence, 

• determining an expected count for each coincidence of interest, the determining 
being performed before, at the same time, or after sampling, detecting and recording, 
comparing, for each coincidence of interest, the observed count of coincidences 
versus the expected count of coincidences, and from this comparison determining a 

20 measure of correlation for the plurality of attributes for the coincidence, and 

reporting a set of k-tuples of correlated attributes, where a k-tuple of correlated 
attributes is a plurality of attributes for which the measu/e of correlation is above a 
pre-determined threshold. 

Appendix "B" contains actual source code written in the Perl language for running on a 
25 Sun4 computer in the Sun UNIX operating system. Sample input data for the code listing in 
Appendix "B" is listed in Appendix "C" for partial amino acid sequences from V3 loop of 
HIV envelope proteins. The corresponding output from the code of Appendix "B" for the 
input of Appendix "C" is shown in Appendix H D M . In order to produce the output of 
Appendix M D", the adjunct Perl language program listed in Appendix "E" was used for 
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clarification and presentation from the main code listing in Appendix "B M . A general flow 
diagram for this embodiment is shown in Figure 6, while a general block diagram is shown in 
Figure 7. The resulting report was stored in a flat file as a relatively unstructured ascii 
database, which was later printed; it could equally well have been sent to a printer directly or 
5 sent across a network for report to other resources. 

Alternative Embodiments 

Descriptions of alternative embodiments of tl z present invention may be divided into 
two categories, described separately below: first, different physical embodiments of the 
system/process as may be used in many potential problem-specific applications; and, second, 
10 different interpretations of the components enumerated in the description above, according 
to different problem-specific applications of the present invention. 

Different Implementations 

For example, among the many possible embodiments as programs on programmable 
digital computers: 

15 The method may be run entirely sequentially, as in the most straightforward 

interpretation of the pseudocode given above, or the method may be run on parallel (vector 
or multiprocessor) or distributed computer systems in many possible ways. A set of 
computations may be run in parallel, in which each computation performs the entire program 
steps outlined above, but with each separate computation using a different value for r, the 

20 sample size; or each separate computation could run the same program steps with same key 
parameter values, but start with different initial random number seeds for the random r- 
sampling. Alternatively, the entire program steps outlined above could be run once, but each 
different /--sample could be forked off into a separate process run on different processors, 
where in each such process would comprise the detection and optionally recording steps, 

25 with the global cset counts later joined into the global process and global data structures. 

Additionally, the computation of the expected counts, and the comparisons of expected with 
observed counts, could be performed all at once or incrementally, sequentially or in parallel. 
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Similarly, the reporting of the estimated correlation values can be performed for some or all 
of the Csets, once at the end of computation or incrementally throughout, in serial or 
parallel 

The output of the method, which can include the reporting of the significantly 
correlated k-tuples of attributes (the csets that are deemed sufficiently highly correlated in 
the comparing, a.k.a., hypothesis testing stage), can be verbal, and/or numerical and/or 
graphical. 

A number of sampling schemes are possible, including deterministic, pseudo-random, 
or purely random. And if pseudo-random or random, any of a number of random sampling 
schemes may be used, including hypergeometric and multinomial sampling. The r objects 
within an r-sample may be sampled "with replacement" or "without replacement". At the 
next level up, the set of r samples themselves may be drawn "with replacement" or "without 
replacement". 

Different choices for the key sampling parameter r are possible, and it is not 
necessary to use the same number r for each sample. 

Many possible choices exist for T, the number of sampling iterations. It is possible to 
use any of a number of mathematical methods for choosing T in order to achieve a desired 
confidence level in the degrees of correlation estimated for the k-tuples of attributes 
discovered by the method of the current invention. Alternatively, it is possible to run the 
procedure for a given fixed number of iterations and then print or view the results, or to 
interleave the running of some number of iterations with the printing or viewing of partial 
results. 

Many possible ways exist for the representation, storage, and accessing of the Csets 
data structure used during the processing of the algorithm. The Csets data may be stored 
and accessed via a hash table, a k-d tree, patricia tree (also called a trie), and/or in other 
ways, known to those skilled in the art, of storing and accessing data efficiently. Whatever 
data structure is chosen, the structure may be stored physically in registers, in main memory, 
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and/or on secondary or external storage media such as magnetic disks, magnetic tape, or 
optical storage media. 

Alternative to the embodiments of the method on general-purpose computing 
hardware of various types, there are many possible embodiments on special-purpose 
5 electronic, optical, or electro-optical hardware, or some combination of general-purpose and 
special-purpose architectures and devices. 

For example, very efficient special purpose electronic (LSI or VLSI) may be used to 
implement the matrix representation of the current invention, by the fact that the incidence 
vectors of attributes are simple binary vectors, by the fact that the coincidence "bins", 
described earlier in one view of the current invention, correspond to "addresses" to a 
memory space of size 2 r for each r-sample, and by the ability with current technology to 
design, fabricate and use special-purpose hardware for implementations of random-number 
generation and sampling, fast-access storage of the Csets data structures, and of the 
mathematical functions used in the calculation of expected count estimates and hypothesis 
tests and correlation estimates. 

Special Purpose Hardware Method Description of a Preferred Embodiment 
1. Overview 

Referring now to Figure 14, an embodiment of special purpose hardware mentioned 
previously is intended to exploit the potential benefits of parallelizing the execution of the 
20 algorithm. A node (defined below) divides a given data set along M (the number of rows of 
data) and distributes these portions to its CPs (also defined below). The CPs may be either 
other nodes (in a recursive definition) or may be special purpose processors developed to 
perform step 8 in the method as described in high-level "pseudo-code" in the previous 
Program Method Description of a Preferred Embodiment Section. When the results have 
25 been computed by the node's CPs, the merging step (steps 9 through 14 in the above-noted 
"pseudo-code" description) is performed by the node. Once the merging has been done, the 
results are passed back to the node's parent. If the node is the root of the tree, the complete 
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results set is sent back to the driver that controls this hardware. The system described below 
can be used "off-line" from a main computer's CPU; among other possibilities for 
commercial marketing and use of such a system is its implementation on a special "board" or 
"card" that a user can purchase and install on his or her personal computer or workstation. 
5 One can also envision the use of one or a number of such special subsystems on a local area 
network or a "supercomputer" installation. The described embodiment represents only one 
of many possible ways, as will be understood by those skilled in the art, to parallelize the 
methods described herein. 

This implementation described below is assumed to act solely on character-valued 
10 data attributes. This is in no way a limitation of the basic methods described herein, rather it 
is a specific implementation of the basic methods. The implementation could easily follow a 
binary-attribute encoding as described elsewhere herein. 

A diagram of a node is shown in Figure 14 with compute processors (CPc). The 
node includes the following: 

A bank of memory where input to be sent to the CPs is stored (the input buffer) and 
where results found by the CPs will be stored (the output buffer). 
A memory bus divided into control, data and address buses used to arbitrate 
communication on the bus itself as well as being the vehicle for data transfer. 
A set of bit flags and a small additional portion of memory (LastOut). LastOut is the 
address of the section in the output buffer that was last written to. The two bit flags 
are used by the merge and I/O processors to determine what state they each are in. 
An array of size J of compute processors (CPs), each with their own local memory 
caches, which perform the discovery of coincidences. 

A merge processor (MG) which has its own cache of memory in which it writes the 
merged results of the CPs. 

An input/output processor (10) whose main responsibility is to control use of the 
bus. 



20 
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A clock which is used to ensure that each element in the system runs synchronously 
with respect to every other element. Execution of each of the parts in the system can 
be thought of as running in lock-step. 

Computer processors are defined as being either special processors that perform the 
5 i?-sampling step of the algorithm (step 8 in the pseudo-code description and graphically in 
Figure 5. This allows the possibility of a tree structure of such nodes rather than limiting 
embodiments solely to a vector arrangement. For any particular choice of hardware for the 
memory bus, it may be the case that there is a maximally useful limit on the number of CPs 
per node. A tree structure allows a way around this limit. 

10 The implementation assumes that maximal values of method parameters R and N 

(Umax and Nmax) are specified a priori. It is the responsibility of the software driver to 
detect when these limits have been violated and react accordingly. 

2. Bank of Memory 

For each node, memory of size 2V*Amax*Rmax*Nmax y where Amax is the maximal 
15 total number of iterations that can be done in the node. This memory is divided equally into 
the input and output buffers. Note that the size of the input for a single iteration is no greater 
xhanJ*Rmax*Nmax and neither the locally-produced results nor the final merged results 
(formed by combining the partial results from the J CPs) can exceed this limit, so there is no 
risk of exceeding available memory. 

20 Access to this memory is as follows: 

IO has write access to the input buffer and read access to the output buffer. 
MG has no access to the input buffer and read access to the output buffer. 
CP has read access to the input buffer and write access to the output buffer 

3. Memory Bus 
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Control of the memory bus is the responsibility of the IO processor. Each CP is 
assigned a numeric identifier (0 to J + 1 as IO is implicitly assigned zero and MG is assigned 
1). The memory bus is divided into three sections: 

Control. Two wires for each CP, two for MG and two for IO comprise the control 
bus. The first of each pair is called the request wire while the second is known as the 
response wire. 

Address: Each device in the system is assigned a unique memory address range. The 
address bus, used in combination with the data bus, determine what device the 
current value on the data bus will be written to and, if applicable, where within that 
device it will be stored. The width of the address bus (i.e. the number of wires in it) 
is determined for a choice of size for the memory storage of input and output and 
thus will not be specified here. 

Data: Given the assumption that only character-valued data attributes will be handled 
by this system, the data bus is eight wires wide. 

Bus arbitration is handled through the use of the control bus. When a device (here 
meaning MG, IO or one of the CPs) wishes to use the bus, it asserts a logical 1 on its 
request wire. On any given cycle, more than one device may have done so. IO, when it 
returns to its bus arbitration duties, simply sets the lowest numbered device's response wire 
to 1 and zeroes all the other response wires. This tells the lowest identified device that it has 
permission to use the bus (reads and writes are not indicated - IO is responsible for 
establishing this context) and all others that they must wait. All devices that wish to use the 
bus continue to assert 1 on their request wire until given permission. When the permitted 
device has finished with the bus, the device asserts 0 on its request wire, indicating to IO 
that it may reassign the bus to another device. "Handshake" and other types of protocols, 
such as described above, are well-known to and understood by those skilled in the art. 

4. Bit Flags and Additional Memory 

The additional memory is used by IO to store the last written output section. There is 
no need to store a list of such sections for MG because "write" s to the output buffer are 
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done incrementally and MG can determine how many unused sections it has waiting by 
comparing its last read index with the last written index. Only IO can write to this memory 
and only MG may read from it. 

Two bits flags are used to indicate "IO finished" (meaning IO has sent all data out 
and received all CP output) and "Merge finished". 

5. An Array of / Compute Processors 

As noted above, these are either nodes or are special purpose processors that 
compute one ^-sampling step in the algorithmic description of the general method of the 
current invention. In the latter case, they may comprise: 

a processor which performs the coincidence detection in addition to the functions 
listed below 

2*Nmax*Rmax sized local memory 

The memory is split into two equal portions for input and output. 

Initially, a CP asserts 1 on its request wire, indicating that it is ready for data. When 
it sees only its response wire set to one on the following cycle, it expects to be sent the 
current values for R and N and then the data itself (otherwise, it waits for this to be the 
case). Based on the first two values, it can determine when the current input is exhausted. It 
then asserts 0 on its request wire and performs the binning and coincidence detection steps 
of the method. When these steps have been completed the CP asserts logical 1 again on its 
request wire, this time indicating its desire to send its results. When given permission to use 
the bus, it sends its coincidence set to IO. IO is responsible for managing the location for 
storage of this data. The output stream of the CP comprises a tally of the coincidences found 
followed by the coincidences (csets) themselves. The coincidences are of the form: 

hit count (no higher than Rmax) 

size (that is, the width of the cset, i.e., the number of component attributes) 
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a size-long list of the attributes of the coincidence in form (value, position) 

When all data has been sent to 10, the CP asserts 1 on its wire to request more data. 

6. Merge Processor MG 

The merge processor may comprise: 
5 a processor that runs the merging step 

NmaxRmax local memory used to store the output from one CP 
counters CI and C2 (the former tracks the last output section read by MG; 
the latter counts the number of coincidences currently stored in the merge 
buffer) 

10 memory used to store the current value of A 

memory of size JNmaxRmaxA max used to store the merged results 

Initially, MG sets its counters to zero and its request wire to zero and waits for IO to signal 
it (by setting this wire to 1) that there is output data to be processed. 

When MG sees that its request wire has been turned on, it knows to start receiving 
15 output data indexed by the counter into its local memory. Once this has been accomplished, 
MG can start the merging algorithm. The merge is done from the local memory directly into 
the merge buffer (C2 must have the current number of coincidences when this step is 
finished). When this step is completed, MG retrieves the current value of LastOut. If it is 
greater than CI, then MG knows it can increment CI and move directly on to the next 
20 output section. If CI and LastOut are equal, then MG sets its request wire to zero. If CI has 
reached A */, then MG knows that all the results have been computed and merged (and thus, 
that all CPs and IO are idle) and that it should set its bit flag to one (indicating that it is 
finished) and start sending the contents of the merge buffer back to IO for transmission to 
this node's parent. The results are sent simply as the" value of C2 followed by the list of 
25 coincidences stored in the merge buffer (the form of the coincidences is identical to that 
described in section 5 above). 
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7. Input/Output Processor IO 

IO contains: 

a bit vector of size J 

a counter, CI, indicating the next available output bin 
5 a counter, C2, indicating the next unused R*N portion of input 

IO is intended to govern the execution of the algorithm as a whole as it is responsible for the 
bus arbitration scheme outlined earlier. Initially, IO sets CI and C2 to zero and zeroes its bit 
vector (indicating that it has sent no data to any CP) and waits for the software driver to 
start sending it data. During this time, it knows that no work can be done, and thus zeroes 
all permissions for the bus. An interrupt signals the arrival of data from the driver and IO 
continues to zero all communication requests until all the data has been written to the input 
buffer. The incoming data is of form: 
N 
R 

7", the total number of row sets of size R sent 
data stream of size TRN 

IO can thus determine when no more data can be expected. Note that it is the responsibility 
of the driver to: 

divide data mining requests into sizes no greater than Amctx 
20 ensure that the number of rows sent as input is evenly divisible by R 

ensure that Rmax and Nmax have not been exceeded by the current data set 
merge all results sent back from the device 

Once all input has been stored, IO sends out data of size R *N to each CP; by first setting the 
ith bit in the vector to one (this indicates that IO should expect output from CPi), signaling 
25 that CP by setting its response wire to 1 while zeroing all others, sending the data onto the 
bus and finally incrementing C2. 
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When all CPs are busy (or all available input has been exhausted), IO waits for a CP 
to assert 1 on its request wire which indicates that it is ready to send back results. Once this 
signal has been received from a CP, IO retrieves the results from the CP, stores them in the 
output section indexed by the counter, zeroes the bit associated with that CP, increments CI 

5 and asserts Ion the MG request wire. If there is unused data in the input buffer, IO sends the 
next available R*N set to the CP who just returned results (setting the bit for that CP to 
one). When C2 equals T and the bit vector contains no bits set to 1, then IO knows that it is 
finished and sets the IO bit flag to 1 . At this point, IO goes back to the previously described 
wait state until it sees the MG bit flag also set to 1 (indicating that MG has finished its 

10 work). Once this occurs, IO calls an interrupt (if this node is the root of the tree) or just 

requests to send (if this node has another node for a parent), gives MG permission to write 
on the bus and then passes all data sent from MG to the parent. 

Note that the proposed scheme allows for unequal execution time among the CPs - 
the next CP to get data is the one most recently finished with its last allowance of data. 
15 Thus, even though the overall operation of the system is clocked, there is a degree of 
asynchronous processing ability. 

The choices for particular processors, buses and other components are open to the 
discretion of designers, fabircators, manufacturers, sellers, buyers and users, and the ranges 
of options are known to those skilled in the art. In particular, all parts of the embodiment 
20 described above may be obtained from "off-the-shelf sources, or may be specially designed 
at the VLSI level by persons skilled in the art. 



Different Applications 
General 



Special-purpose embodiments are also possible. For example, in an application to 
25 marketing and analysis of sales/transactions data, the objects input to the methods of the 

present invention can correspond to transactions, and the attributes correspond to instances 
of sale of particular products or services. 
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In an application to the process management, industrial engineering or computer 
systems management, the objects can correspond to particular time slices or time periods, 
and the attributes correspond to the on/off or used/unused status of particular components, 
resources, or subsystems. The goal of the application could be to find k-ary conflicts or 
5 conflicting demands among interacting subsystems or users, in order to improve the 
efficiency or lower the costs of the operations. 

For example, the methods can be adapted to control a process for production of a 
product as shown in the general flow diagram of Figure 8 and the schematic diagram of 
Figure 9, This example can represent an automated sheet metal assembly plant. The 
methods could be applied to existing data set in order to discover correlation that indicate 
demand for one of the products from the plant will significantly decrease in the summer 
months due to cyclical variations, while demand for another product increases. A link to 
automated process control systems in the plant could reduce orders for the first product, 
while increasing orders for another. Many other examples will be evident to those skilled in 
the art, including variations to the actual structure of the products as a result of discovered 
correlations. 

In an alternate embodiment, the discovered correlations may be used to generate 
rules for a rules based system that in turn produces products based upon those rules. A 
general flow diagram for such an embodiment is set out in Figure 10. A corresponding 
20 schematic diagram is set out in Figure 1 1 . 

In a further alternate embodiment, the rules based system could be used to control a 
process that creates products. A general flow diagram for such an embodiment is set out in 
Figure 12. A corresponding schematic diagram is set out in Figure 13. 

In application to financial analysis or trading, the objects can correspond to particular 
25 time slices or time periods, and the variables can relate to particular prices, or price changes, 
of particular financial instruments or commodities. By dividing the prices of each instrument 
or commodity into a set of discrete levels, or by using a simple binary code for "increase vs. 
decrease", one can represent each such instrument or commodity by a set of attributes, and 
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the invention can be employed to discover k-tuples of instruments or commodities whose 
price movements are correlated. Those in the art know of many ways to gain value from 
such discovered information. 

In applications to medicine, epidemiology, or environmental science, the objects can 
correspond to particular patients, or to different timed observations of a single patient, or 
samples from the same or different environmental resource (such as air, soil, or water); the 
variables and derived attributes would correspond to levels, or the presence/absence of 
particular symptoms, drugs, toxins or contaminants. In this way, one can use the present 
invention to discover interactions that may cause disease or environmental hazards. 

In molecular and structural biological applications, the objects might correspond to 
DNA, RNA, or protein sequences and/or structures. The attributes might correspond to the 
presence of particular bases or amino acids at particular sequence positions, or to 
substructures with particular geometric, chemical, physical, or biological properties at 
particular sequence or structural positions, or to the presence or absence or levels of other 
global or local properties. For example, set out further below is a detailed application of the 
method to protein structure prediction, examples of which have previously been described.. 

In pharmacological applications, the object might correspond to molecular structures 
or other labels or representations of particular compounds or drugs, and the attributes might 
correspond to the presence, absence, or levels of particular geometric, chemical, physical, 
biological, toxicological, therapeutic and/or other properties and features, e.g., particular 
chemical moieties. The present method would be used to find correlations among k-tuples 
of such properties, and this information can be useful in the design and testing of compounds 
and drugs, and in the design of combinatorial libraries for screening and testing, or for other 
processes or steps in drug discovery and drug design. Alternatively, the above mapping can 
be transposed, so that the objects correspond to the properties and features, and the 
attributes correspond to the compounds and drugs. In this way, the present invention can be 
used to find sets of drugs with similar or complementary or synergistic or antagonistic 
activities. This, too, is extremely useful in drug discovery and drug design. 
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In applications to demographics, marketing, insurance and credit ratings, and/or 
fundraising, the objects can correspond to particular people, or companies, or organizations. 
The attributes could correspond to the presence or absence or levels of properties and 
features relating to employment, income, wealth, credit history, lifestyle, consumption 
5 patterns, or social/political opinions or affiliations. The present method could be used to 
discover associations between such factors, which can be useful in such tasks as predicting 
credit/insurance risks or detecting fraud; or in determining the best targets for allocating 
limited marketing or fundraising resources, for example. 

The problem of finding ail significant correlations among pairs or ^-tuples of 
attributes in a database is ubiquitous in the computational sciences and in medical, industrial, 
and financial applications. The principles described herein include a probabilistic algorithm 
that has the interesting property of finding significant higher-order &-ary correlations, for all 
k such that 2 <; k<, N in an A^-attribute database, for the same computational cost of finding 
just significant pairwise correlations. Moreover, k need not, be fixed in advance in our 
procedure, in contrast with other known procedures. The procedure was deigned for the 
task of finding conserved structural relationships in aligned protein sequences, but may have 
more useful application in other domains. 

Application of the Principles Described Herein to Protein Sequence Analysis 

There are interactions between sequence-distant amino acid residues in the protein 
20 chain, sometimes detectable as correlations between positions (columns) in a set of aligned 
sequences from a protein structural family, that play an important role in determining 
structure and function. Discovered correlations may represent an evolutionary history of 
compensatory mutations, and may provide useful features in models of protein 
structural/functional families, but are ignored or mishandled by most ML (machine learning) 
25 classification methods, in part because of the high computational complexity of searching for 
k-tuples of correlated positions. 

In order to practice the invention on a matrix of biological sequences such as 
nucleotide or amino sequences, the different sequences are first optimally aligned for the 
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purpose of comparison. A position in a first sequence is compared with a corresponding 
position in a second sequence. When the compared positions are occupied by the same 
nucleotide or amino acid, as the case may be, the two sequences are identical at that 
position. The degree of identity between two sequences is often expressed as a percentage 
representing the ratio of the number of matching (identical) positions in the two sequences 
to the total number of positions compared. Optimally aligning two or more sequences 
generally involves maximizing the degree of sequence identity between them. 

Several algorithms and computer programs are known to those of ordinary skill in 
the art for aligning sequences. These tools include the PILEUP program from the Genetics 
Computer Group (Madison, WI)package (version 8) using a modified version of the 
progressive alignment method of Feng and Doolittle [J. Mol. Evol. 25, 351 (1987)]; 
CLUSTAL X, freeware available from the European Molecular Biology Laboratory 
(EMBL), Heildelberg, Germany; and BLAST, freeware available from the National 
Institutes of Health (NIH), Bethesda, MD., BLAST-P is used for amino acid sequences; 
BLAST-N is used for nucleotide sequences and BLAST X is used for nucleic acid 
codon/amino acid translation. 

Several kinds of useful information can be obtained from protein sequence family 
analysis. 

First, there is information to be extracted at the level of individual sequences, in the 
form of joint symbol frequencies. It is well-known that an abnormally high observed 
frequency of a particular single-position pattern (e.g., "G occurs at residue number 3 in 98% 
of these sequences") can reveal an important physico-chemical constraint on secondary or 
tertiary structure. This is also true of surprisingly-frequent joint symbol occurrences (e.g., 
"G at position 3, L at position 5, and M at position 87 occurs much more often than would 
be predicted by the individual marginal frequencies"). Such long-distance co-occurrences 
might be especially indicative of tertiary constraints, because the designated positions may be 
nearby each other in the 3D structure to which all of the modelled sequences correspond. 
(This detection of "suspicious coincidences", as when p(A f B) » p(A)p(B), is at the heart of 
pattern recognition and learning, as noted long ago by others). 

-58- 

SUBST1TUTE SHEET (RULE 26) 

98431 82A1J_> 



WO 98/43182 



PCT/CA98/00273 



Second, there is information to be extracted at the "next level up", of statistical 
relationships between the positions (columns in an alignment of homologous sequences). If 
the existence of frequently occurring joint symbol /r-tuples can be used to infer 3D structural 
interactions, such an inference is even better supported by certain information-theoretic 

5 relationships between positions (columns) over a set of many different joint symbol 
occurrences. This is because such symbolic relationships can signify evolutionarily 
conserved physical or structural relationships between different parts of the protein chain. 
(See Figure 15). The observation of high values of mutual information and other correlation 
measures between columns has been used successfully to predict 3D structural interactions 

10 in RNA and in HIV proteins, for example, see C.E. Shannon and W. Weaver The 

Mathematical Theory of Communication The University of Illinois Press, 1964. While these 
previously reported efforts have focused on pairwise residue-residue interactions, the 
principles described herein, aim at the detection of fc-ary interactions for 2 <, N. 

Discovered k-tuples of correlated amino acid residues cane be used in protein 
15 structure prediction and structure determination. 

Local predictions can help narrow the search for the best global structure 
predictions. 

First, there are distance geometry constraints. Secondary structure prediction, and 
the discovery of £-ary long-distance interactions, give evidence for presumed contacts, of 

20 the form contactftj) for the ith and yth amino acid residues in a protein. Using the kind of 
distance geometry theory developed by others (see for example, T.F. Havel, L.D. Kuntz, 
G.M. Crippen The Theory and Practice of Distance Geometry Bull, of Mathematics Biology 
v.45 1983 pp. 665-720. and KA. Dill, K M. Feibig, H.S, Chan Cooperativity in Protein- 
Folding Kinetics Proc. Natl. Acad. Sci. U.S.A. v.90 March 1993 pp. 1942-1946), one can 

25 derive a set of inferred contacts. One can also derive sets of inferred blocks, contacts that 
are forbidden by a given set of presumed or inferred contacts. Essentially, given a model of 
a polymer chain constrained to exist within a fixed volume, the assumption that two 
particular pieces are brought into contact implies that some other pieces are also brought 
into proximity and that still other pieces are moved further apart. Indeed, others have 
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concluded that "considerable amounts of internal architecture (helices and parallel and anti- 
parallel sheets) are predicted to arise in compact polymers due simply to steric restrictions. 
This appears to account for why there is so much internal organization in globular proteins." 

Second, as discussed throughout the previous sections, one can infer and exploit 
5 empirical relationships between local and global configurations. Local stretches of sequence, 
or selected non-local pairs of residues, can be found to occur, with some high probability, in 
particular global configurations. Heuristic rules, in whatever form, can be used to avoid 
large parts of conformation space. The inference of particular models of cooperativity in 
folding is a special case: knowledge of "rules" such as p(contact^^)\contact(i + 1,7 - 1)) > 
10 p(contact(iJ)) can help significantly. 

For example, Figure 16 illustrates steps in tertiary structure prediction. The methods 
described throughout this application can be applied as part of a larger tertiary structure 
prediction system, wherein the principles described above are employed in the block related 
to the analysis of aligned sequence families. The system predicts the structure of a protein. 

15 Discovery of Evolutionarily-Conserved Structural Constraints 

Three questions are addressed in this section: 

1 . What kinds of evolutionarily conserved multi-residue structural or functional 
constraints might one expect to find by detecting correlations between 
columns in a multiple sequence alignment? 

20 2. Have correlation-detection efforts in fact found important structural or 

functional constraints? 

3. How much information do such discoveries provide towards predicting or 
determining a molecule's native tertiary structure? 
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What Do We Expect to Observe? 

A protein family is the set of amino acid sequences that are believed to share a 
common global tertiary structure. The theory and observation of protein folding and 
evolution supports the general idea of evolution and conservation within a protein family: 

• Functional constraints are conserved in surface residues; 

• Structural constraints are conserved in core residues; 

• Mutational drift dominates in loop residues; 

Functional constraints often involve other molecules - such as other proteins, nucleic 
acids, lipids, metals, 0 2 or other small molecules. 

The kind of structural constraints expected to be conserved throughout evolution of 
a protein family are mainly those involving a few key residues that stabilize a confirmation. 
Where electrostatic interactions are deemed important, one might expect to find a 
conservation of net charge across two or more sequence positions. When one of two 
electrostatically interacting residues carries a positive charge, its "partner" residue 
(presumably close in 3D structure even if distant in sequence) should be negatively charged, 
and vice versa. The situation is similar for packing constraints. One might reasonably 
expect sections of the protein core volume to vary only slightly across the many different 
proteins in the same structural family, while non-core regions might display large volume 
variability. Thus one might expect to find pairs or small /r-tuples of residues that display 
mutually compensatory mutations with respect to side-chain volume - when a "Large" 
mutates to a "Small", another "Small" must mutate into a "Large", to put it simplistically. 

What Has been Observed? 

Neher et al (How frequent are correlated changes in families of protein sequences 
PNAS, 91:98-102, 1994) attempted to quantify the frequency of compensatory changes 
within a single protein family by using physico-chemical property indices for amino acids and 
then estimating Pearsonian correlations between columns in an alignment. They attempted 
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to get around the small-dataset problem with a bootstrap-inspired resampling scheme based 
on the examination of pairs of sequences from the family. Their study of the myoglobin 
family of protein sequences found the degree of compensatory mutation to be low for the 
property of side-chain volume but high for electrical charge - close to the correlation level 
expected for perfect conversation of local charge. The authors speculate that because their 
column-pair analyses focused only on contact-neighbour pairs of residues, they were able to 
detect a very locally-acting constraint like charge conservation but not a more distributed 
constraint like conservation of volume. (In other words, a single positively-charged residue 
must be in contact with its single negatively-charged structural partner, whereas a set of 
compatible- volume partners may comprise more than two residues and need not all be in 
contact). Others have also found some evidence of coordinated mutation in the evolution of 
protein structural families. 

While most studies, to date, of compensatory mutation focus on highly-conserved 
"core"-type regions of protein structures, Korber et ai (Covariation of mutations in the V3 
loop of fflV-1: An information-theoretic analysis. Proc. Nat. Acad. Sci, 90, 1993) analyzed 
the highly-variable V3 loop of the HIV-1 envelope protein. The researchers performed 
robust bootstrapped estimates of the pairwise mutual information for all column-pairs from a 
set of 3 1 columns, representing V3 residues. They found a set of about seven pais that 
showed considerable and statistically-significant mutual information, and their analysis of the 
particular attributes (amino acids) suggested a particular pattern of highly likely 
compensatory mutations. Although the authors did not argue or provide evidence for any 
particular properties or relationships being conserved, subsequent mutational analysis 
experiments in the laboratory indicated functional linkage between some of the pairs of sites 
with high mutual information. Because the V3 region is known to be both functionally and 
immunologically important, the inventor of the instant application suggested that such 
analyses might be important in the search for HIV/ AIDS vaccine design. 

What Kind of Method is Needed? 

Clearly, several well-studied and effective methodologies exist for the comprehensive 
modelling of protein sequence families. In each case, the mathematical machinery is in place 
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to handle and detect very local and low-order statistical structure in the data. In each case, 
the difficulties with computational complexity and statistical estimation arise in the attempt 
to account comprehensively for all possible non-local and higher-order interactions between 
residues, i.e., columns in the aligned sequence data. 

5 Easier progress in modelling can be made if one is to use HMMs or density networks 

in conjunction with a fast, heuristic preprocessor that focuses explicitly on the detection of 
plausible non-local interactions while sacrificing a degree of precision in modelling these 
interactions. Such a procedure is provided by the principles described herein. 

a) HIV PROTEIN SEQUENCE ANALYSIS 
10 Tests on an HIV Protein Database 

The Los Alamos HIV database contains, among other things, the amino acid 
sequences for the V3 loop region of the HIV envelope proteins. This region is known to 
have functional and immunological significance, and the discovery of sets of sites linked by 
evolutionary covariation might have important implications for understanding and preventing 
15 HIV infection and replication. 

An earlier and smaller version of the same database was used by Los Alamos 
scientists in their analysis of pairwise mutual information between residues (columns). 

Experiments were performed on an HIV dataset with the coincidence detection 
procedure, over a set of different values for r and T. Tables of results are shown and 
20 discussed below. 

Results of Experiments on HIV Protein Database 

The aforementioned version of the HIV-V3 dataset was edited in order to focus on 
the thirty-three residues considered most conserved and most structurally and functionally 
important by the Los Alamos researchers. The dataset therefore consisted of M = 657 rows 
25 (sequences) of N = 33 columns (residues). For the coincidence detection procedure, these 
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33 columns are transformed into N A = N M\ = 33 .21 = 693 attributes. As with the 
artificial datasets, a set of experiments with different values of T and r were performed. 
Coincidence detection runs were done with T= 10,000 and r = 5, 6, 7, 10 respectively, and 
with T= 100,000 and r = 7, and finally with T= 750,000 and r = 7. The results are shown 
5 in tables C. 1 through C.9 below. 

Table C. 1 : The most likely correlated attributes, as estimated by the coincidence detection 
procedure, for the HTV dataset. These results were produced with parameter settings 
T= 10,000 andr= 5. 

HTV Dataset. 
10 r= 10,000, r = 5. 



Rank 


CSET 


Observed 


Expected 


Prob. 


1 


£>I7|D24 


1012 


632.553864 


0.316056 


2 


R17\T1\ 


901 


610.770465 


0.509734 


3 


R\2\Q\1 


570 


348.605833 


0.675621 


4 


L13\W\9\Q24 


195 


5.535741 


0.750381 


5 


N4\K9\A2\ 


226 


74.167398 


0.831582 


6 


V\\\R12\T\S 


159 


20.764346 


0.858239 


7 


fll2|718 


454 


318.517747 


0.863429 


8 


L13|K31 


419 


300.333903 


0.893461 
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Table C.2: The most likely correlated attributes, as estimated by the coincidence detection 
procedure, for the fflV dataset. These results were produced with parameter settings 7=10, 
000 and r= 6. 



HIV Dataset. 
5 T= 10, 000, r = 6. 





Rank 


CSET 


Observed 


Expected 


Prob. 




i 
i 


(1\7\D24 


1177 


385.853329 


0.030891 


1 A 
10 


X. 


Jt\ 1\T1\ 


957 


368.736702 


0.146238 




1 
J 


1 1 i yri. l o 


1047 


577.583832 


0.294000 




A 


01OID24 


859 


424.457490 


0.350274 




c 

J 


/?i Oini 7 

A 1 +'\} t s I i 


656 


224 743830 


0.355855 




O 


/V 1 Zi| 1 1 o 


628 


283.191527 


0.516585 


1 c 

Id 


7 


iV 1 / ji-/X»~ 


563 


234.477161 


0.549033 




8 


//12|#17 


760 


434.274580 


0.554644 




9 


/118|721 


560 


315.973734 


0.718330 




10 


711|/217 


861 


627.014684 


0.737741 




11 


L13|Pfl9|e24 


230 


5.365202 


0.755529 


20 


12 


,421|Z)24 


619 


405.487239 


0.776262 




13 


A^4|A:9|^21 


237 


25.176801 


0.779367 




14 


Fll|/?12|n8 


220 


15.841474 


0.793296 




15 


L 13 |K31 


462 


267.211446 


0.809942 




16 


GIO|//12 


324 


157.554658 


0.857348 


25 


17 


A^13|^15 


245 


84.760597 


0.867059 




18 


2 17|AT31 


384 


231.749746 


0.879169 




19 


//12|tfl7|/118 


147 


8.219536 


0.898526 




20 


NA\K9\H1-i 


309 


170.353419 


0.898711 
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Table C.3: The most likely correlated attributes, as estimated by the coincidence detection 
procedure, for the HIV dataset. These results were produced with parameter settings T- 10, 
000 and r = 7. 

fflV Dataset. 
5 T= 10,000, r = 7. 





Rank 


CSET 


Observed 


Expected 


Prob. 




i 


\Ji f\LJZH 


1 J 1Z 


9">8 80077 S 


ft ftftRT77 


W 








996 505631 


0 013558 




■J 
J 




1 175 

1 1 /J 


JLO.ZDJ07J 






A 


/?1 7IT0 1 
/v 1 r\l Zl 




X. IO.HJ 1 J7 1 


v.i 1 0\J I J 






1 1 J/'*'* 
1 |X7 J J 




94R1 0S0Q1 S 

^*tO 1 .\J-s\JZs 1 _J 


0 122699 




f. 
\J 


/\ 1 I o 


879 


244 789294 


0 193645 


15 


7 


olO|D24 


836 


232.201517 


0.225812 




8 


R\2\Q\7 


720 


140.866087 


0.254370 




9 


/11|/?17 


808 


360.719364 


0.441944 




10 


/H2|/?17 


659 


253.717115 


0.511491 




11 


/?17|/11I8 


720 


361.819054 


0.592356 


20 


12 


^21(D24 


554 


236.085429 


0.661974 




13 


R\7\E24 


452 


138.843412 


0.670137 




14 


L\3\K3\ 


537 


231.137972 


0.682602 




15 


I13|^19|Q24 


292 


5.055474 


0.714573 




16 


y418|721 


442 


165.231990 


0.731502 


25 


17 


/418|g31|//33 


480 


209.122778 


0.741198 




18 


MU\W\5 


355 


88.975694 


0.749122 




19 


N4\K9\H33 


340 


75.556215 


0.751690 




20 


VI \\R\2 


513 


253.001684 


0.758878 
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Table C.4: The most likely correlated attributes, as estimated by the coincidence detection 
procedure, for the HTV dataset These results were produced with parameter settings 7=10, 
000 andr= 10. 

HIV Dataset. 
5 7^ 10,000, r= 10. 



Rank CSET Observed Expected Prob. 



10 



15 



20 



25 



1 


Q3l\H33 


3933 


883.532458 


0.000000 


2 


N4\K9 


2898 


251.248235 


0.000001 


3 


510|F19 


2245 


907.769718 


0.027977 


4 


F19|G23 


2660 


1588.173503 


0.100497 


5 


R\2\TIS 


1155 


142.229768 


0.128554 


6 


K9\I\l 


1230 


311.653160 


0.185125 


7 


AIS\H33 


1720 


990.576490 


0.345032 


8 


K9\H33 


1125 


405.874883 


0.355482 


9 


H\2\Al& 


732 


54.213558 


0.399002 


10 


S10|G23 


1492 


856.152048 


0.445479 


11 


N4\H33 


1257 


689.784961 


0.525468 


12 


AIS\Q3\ 


1188 


636.901303 


0.544755 


13 


Q\1\D2A 


571 


42.938312 


0.572525 


14 


V\\\R\2 


670 


143.659674 


0.574607 


15 


I\\\R\1 


562 


61.788305 


0.606274 


16 


N4\R17 


992 


498.586806 


0.614520 


17 


R\2\Q\1 


484 


31.204991 


0.663619 


18 


AT31|y33 


578 


130.131866 


0.669535 


19 


R\1\T2\ 


479 


39.372545 


0.679400 


20 


S\0\D2A 


451 


34.199456 


0.706491 
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Table C.5: The thirty most likely correlated attributes, as estimated by the coincidence 
detection procedure, for the HTV dataset. These results were produced wich parameter settings 
r= 100, 000 and = 7. 

fflV Dataset. 
5 T= 100, 000, r = 7. 





Rank 


CSET 


Observed Expected 


Prob. 




l 


£71 91 A 1 ft 


11£R£ "??89 fi"?6Q26 


0 000000 


10 


z 


ami jra 


^iOJJ 77UJ.UJUJUO 


0 000000 

\J . \J\J\J\J \J\J 








1 1 SRS 79RR 907747 


0 000000 




A 




-1171*; 94810 509148 


0 000000 




c 

J 


PI 7I77 1 
/vl / |i Z 1 


Cn S S 7 1 64 1 1 1Q06 


0 000000 




O 


Pi 71/01 7 


77^0 140ft 660R6R 


0 000001 


15 


/ 


PI 7171 ft 


R1RO 7447 RQ2Q16 


0 000001 




Q 
o 


Cioir>74 

O 1 v|i-/Z*t 


7666 2122 01 5166 


0.000009 




9 


/ll|Al7 


83 JO io0/.iy3o43 


n nnn i no 




10 


y421|£)24 


6342 2360.854285 


0.001550 




11 


#12|.R17 


6363 2537.171146 


0.002543 


20 


12 


fll7[/118 


7162 3618.190543 


0.005941 




13 


R\7\E24 


4451 1388.434119 


0.021747 




14 


A\S\12\ 


4673 1652.319901 


0.024130 




15 


Vl\\R\2 


5486 2530.016841 


0.028256 




16 


L13\K3\ 


5224 2311.379719 


0.031348 


25 


17 


N4\K9\fm 


3519 755.562151 


0.044291 




18 


/i 18|03 1|#33 


4665 2091.227775 


0.066951 




19 


L 13 \W\ 9|024 


2585 50.554739 


0.072672 




20 


K17|g31 


5967 3574.032278 


0.096592 




21 


A/13|^15 


3204 889.756945 


0.112364 


30 


22 


ni|/?12|ri8 


2424 117.500168 


0.114017 
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23 


N4\A2l 


6209 


4030.321314 


0.144077 


24 


K3l\Y33 


4878 


2773.817984 


0.164117 


25 


gl7|A31 


3440 


1450.098718 


0.198651 


26 


£91421 


5614 


3692.671816 


0.221632 


27 


P19|D24 


3998 


2250.071839 


0.287354 


28 


£>17|421 


4151 


2414.536189 


0.292077 


29 


G10|//12 


2661 


953.572593 


0.304245 


30 


#12|£24 


3018 


1458.576938 


0.370622 
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Table C.6: The first twenty-five of the fifty most likely correlated attributes, as estimated by 
the coincidence detection procedure, for the HIV dataset These results were produced with 
parameter settings T= 750, 000 and r = 7. Note the appearance, at this degree of sampling, of 
several statistically significant higher-order features with k ^ 3. 

5 HTV Dataset. 

T= 750,000, r = 7. 





Rank 


CSET 


Observed Expected 


Prob. 


10 


0 


/418|031|//33 


36019 15684.208314 


0.000000 




1 


A\S\T2\ 


33816 12392.399254 


0.000000 




2 


A2\\D24 


45549 17706.407140 


0.000000 




3 


H\2\A1& 


86025 24619.776947 


0.000000 




4 


H\2\R\7 


48257 19028.783592 


0.000000 


15 


5 


IJ1\R17 


64548 27053.952336 


0.000000 




D 




3Q"?R9 17335 347894 


0 000000 

\J . \J\J \J\J\J\J 




7 


L\3\W19\Q24 


20184 379.160544 


0.000000 




8 


MU\W\S 


23300 6673.177086 


0.000000 




9 


NA\K9 


162152 74737.922307 


0.000000 


20 


10 


m\K9\H23 


26376 5666.716129 


0.000000 




11 


Q\1\D2A 


86891 17162.233105 


0.000000 




12 


£>31|//33 


23319086078.318611 


0.000000 




13 


R\2\Q\1 


53740 10564.956512 


0.000000 




14 


R\2\T\% 


62774 18359.197022 


0.000000 


25 


15 


/?17|/il8 
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21 VU\R\2\T\S 17628 881.251263 0.000000 

22 A31|I33 36346 20803.634880 0.000002 

23 #4^21 45441 30227.409858 0.000003 

24 Q\7\K3\ 25033 10875.740384 0.000018 

25 G10\H\2 20779 7151.794446 0.000041 
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Table C.7: Continuation of the fifty most likely correlated attributes, as estimated by the 
coincidence detection procedure, for the HIV dataset: csets ranked 26 through 50. These 
results were produced with parameter settings T= 750, 000 and r = 7. Note the appearance, 
at this degree of sampling, of several statistically significant higher-order features with k z 3. 

5 HIV Dataset. 

r= 750,000, r= 7. 
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47 
48 
49 
50 



H12\T2\ 
Q17\Y33 
L\3\tV\9 
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6529 138.997153 



0.072825 
0.074203 
0.092437 
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Table C.8: The top thirty-five painvise inter-column mutual information values for the HIV-V3 
dataset, as estimated by our methodology as described in the main text. 



5 


Rank 


Pair /, 7 


M/(c, Cj ) 


Std. Error 




1 


i on q 
12|lo 




0 0177Q7 
U.UJ / /^Z 




2 


a io 

4|y 




0 01RQ1 67 




3 


AtO 1 

9|21 


nil Q/ic l 


n 01^1R7Q 
U.Uj jjOZ" 




4 


23|24 


U.J I 3ZUZ 


U.Ujj /Z 1 J 


10 


5 


12|24 


U.J 14JVJ 






o 


Ol7>! 

y(z4 


a 1 1 7QQ7 

u.j i jyyz 


n nidA7i7 




/ 


iy|24 


u. JUDouy 






o 

8 


1 1|24 


u.zy /**vo 


n ni^s*6J.^ 




9 


24|2o 


u.zyuu^ 




15 


10 


All 1 

9|1 1 


a noooi l 


U.UJ'fHZH** 




1 1 
1 1 


017** 

y|ZJ 




0 0141994 




1Z 


/II7 1 
4|Z 1 


O 7R4Q16 






i J 


1 o|Z 1 


O 77R1 *\1 

U.Z / O 1 -> l 


0 0404614 




l*t 


AW 1 


0 9771 RO 


0 01S1991 


20 


1 c 

I J 


1 7t7 1 
1 Z|Z 1 


0 971 1 17 


0 01118S 






AI7A 
^IZH 




0 016189 




17 


21|24 


0.260366 


0.0338395 




18 


1 1123 


0.260337 


0.0323302 




19 


11|19 


0.249877 


0.0320634 


25 


20 


10|24 


0.248938 


0.0325318 




21 


19|23 


0.242185 


0.032301 




22 


5|26 


0.239395 


0.0386373 




23 


9|19 


0.238318 


0.0331283 




24 


4|23 


0.23359 


0.0302795 


30 


25 


24|25 


0.222109 


0.0358744 




26 


6|26 


0.220371 


0.0397722 



-74- 



SUBSTITUTE SHEET (RULE 26) 

BNSDOCID: <WO 9843182A1„L> 



WO 98/43182 



PCT/CA98/00273 



27 


4|26 


0.220213 


0.0333324 


28 


6|24 


0.218815 


0.0335123 


29 


9|12 


0.214844 


0.0280984 


30 


15|24 


0.213921 


0.0301834 


31 


10|12 


0.2133 


0.0306496 


32 


9|18 


0.21078 


0.031734 


33 


11|21 


0.210155 


0.0308121 


34 


11|12 


0.209421 


0.0294066 


35 


4|19 


0.2091 1 


0.0290533 



10 



-75- 



SUBSTITUTE SHEET (RULE 26) 

BNSDOCID: <WO 9843182A1J_> 



WO 98/43182 



PCT/CA98/00273 



Table C.9: The top seven pairwise inter-column mutual information values for the HTV-V3 
dataset, as estimated by the Los Alamos group. 
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Tables C.l through C.4 illustrate the most significant csets (again 
15 measured by our procedure's estimation of P(Observed\Independence) for the Observed 

number of coincidences for each detected coincidence of attributes. As one might expect, a 
clean separation between "probably correlated" and "probably uncorrected" does not 
manifest itself at this comparatively low degree of sampling for this real-world dataset. 
Results for r= 7 and r = 10 indicate more significant discovered csets than those for r = 5 
20 and r = 6. At these former, higher r values, one sees the emergence of a few csets with 
"Prob" values less than 0.1: (Q@17, D@24), (N@4 7 K@9), (H@\2, A@l$). (Q@31, 
H@23) and (S@\0 t F@\9). All of these csets appear among the most significant csets 
reported in the more intensive sampling runs (with T= 100,000 and 7=750,000), with the 
notable exception of (S@10, F@19). This latter cset is discovered at this low degree of 
25 sampling only in the r = 1 0 run, and does not appear in the more intensive sampling runs 
shown, both of which used r = 7. 

Table C.5 displays the results for T= 100,000 and r = 7, and here it is clear that 
some separation of signal from noise is taking place amongst the set of HOFs, with 
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seventeen pairwise and three 3-ary correlations appearing within our Prob < 0. 1 significance 
level 

At T= 750,000, we have more statistically significant detection of almost fifty 2-ary, 
3-ary and up through 6-ary attribute correlations, as shown in Tables C6 and C.7. 

5 In order to get a better sense of the possible meanings of these results, let us consider 

these inter-attribute correlations along with some inter-column correlations in the form of 
pairwise mutual information estimates performed in our own analysis and also by the Los 
Alamos group. Table C 8 displays the highest estimated mutual information values amongst 
all 

10 N -N = 528 pairs of columns from our 33-column dataset. The estimates were obtained 
usjng a 

Bootstrap-like procedure in which 1000 sample data subsets of m = 300 out of A/= 657 
were drawn and run though the standard mutual information calculation. Reported in the 

15 table are therefore the mean values over the resampling and the associated standard error 

values. There is significant intersection between the set of column-pairs indicated by the top 
cset values in Tables C.6 and C.7 and those indicated by the top mutual information values 
in Table C.8. The correspondence between the two rankings is not perfect, for a few 
reasons (besides noise and simple sampling error). First and foremost, while the 

20 "suspiciousness" of a single joint-attribute combination certainly contributes to the mutual 
information within the corresponding set of columns the behaviour of the other symbols 
appearing within the columns obviously also can have great effect. Second, we note again 
the observed sensitivity coincidence detection results to the choice of r. 

Table C.9 lists the highest statistically significant mutual information values as 
25 estimated by the Los Alamos group. We note the overlap between their list and ours, but 
we emphasize again that group's use of an earlier, smaller, and perhaps otherwise different 
database to which we did not have access. 

Application of the coincidence detection method of the invention to biological data such as 
these aligned HIV sequences thus leads to identification of covarying structural elements 
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that were previously unrecognized. The statistically significant coincidence of particular 
structural elements, such as amino acid residues, likely indicates a biological role for a motif 
comprising the covarying elements, as structure and function are tightly linked in 
biochemical systems. One such example from the above application of the invention is the 
statistically significant coincidence of residues A18, Q31 and H33 in the V3 loop of HTV 
envelope protein. These residues are expected to contribute to a structural motif of the V3 
loop that plays a biological role in the HIV life cycle. Such new information about 
A18/Q31/H33, which prior to the invention have never before been grouped together for a 
particular biological role, may be exploited in various ways, as follows. 

A peptide or peptidomimetic mimicking the afore-mentioned structural motif of the 
V3 loop (or another protein motif identified by the coincidence detection method) is 
provided by the invention. For the chosen example, the peptide or peptidomimetic would 
include spatial coordinates of amino acid residues A18/Q31/H33, though every atom of these 
amino acids would not necessarily be required. Rather, the peptide or peptidomimetic 
would have such spatial coordinates of A18/Q31/H33, as well as topological and 
electrostatic attributes, that would make it useful for a biological function, such as, for 
example competing with the actual V3 loop of HIV for binding to another biological 
molecule, where such binding of V3 would employ the structural motif that is mimicked by 
the peptide or peptidomimetic. 

Alternatively, a peptide or peptidomimetic which is designed based on covarying k- 
tuples discovered by the coincidence detection method could be used as an antigen. That is, 
the biological function which the molecule mimics is eliciting an immune response in an 
animal. Similarly, vaccines embodying the covarying k-tuples described herein are also 
encompassed by the invention. 

Morgan and co-workers (Morgan et al 1989. In Annual Reports in Medicinal 
Chemistry. Ed.: Vinick, F.J. Academic Press, San Diego, CA, pp. 243-252.) define peptide 
mimetics as "structures which serve as appropriate substitutes for peptides in interactions 
with receptors and enzymes. The mimetic must possess not only affinity but also efficacy 
and substrate function." For purposes of this disclosure, the terms "peptide mimetic" and 
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"peptidomimetic" are used interchangeably according to the above excerpted definition. 
That is, a peptidomimetic exhibits fimction(s) of a particular peptide, without restriction of 
structure. Peptidomimetics of the invention, e.g., analogues of the structural motif of the V3 
loop posited above, may include amino acid residues or other chemical moieties which 
5 provide the desired functional characteristics. 

The invention further provides a ligand that interacts with a protein having a 
structural motif identified using the coincidence detection method of the invention, as well as 
a pharmaceutical composition including the liganc and a pharmaceutical^ acceptable carrier 
or exicipient therefor. The ligand would include chemical moieties of suitable identity and 
10 spatially located relative to each other so that the moieties interact with corresponding 

residues or portions of the motif. By interacting with the motif, the ligand could interfere 
with function of that region of the protein including the motif 

Thus, the invention provides a pharmaceutical composition for interacting with an 
envelope protein of human immunodeficiency virus (HIV), including a ligand having a 
functional group that interacts with the structural motif of the V3 loop which has spatial 
coordinates of residues A18/Q31/H33, and a pharmaceutical^ acceptable carrier or 
exicipient therefor. The ligand may have more than one functional group that interacts with 
the motif, such as, for example, a first functional group capable of binding to and being 
present in an effective position in the ligand to bind to residue 18, a second functional group 
capable of binding to and being present in an effective position in the ligand to bind to 
residue 3 1, and a third functional group capable of binding to and being present in an 
effective position in said ligand to bind to residue 33. 

The invention further provides a mettibd of designing a ligand to interact with a 
structural motif of an protein, such as, for example, envelope protein of human 
25 immunodeficiency virus (HIV). For example, in the case where the motif is the potentially 
interesting A18/Q31/H33 motif identified by the coincidence detection method discussed 
above, the method of designing includes the steps of providing a template having spatial 
coordinates of residues A18, Q31 and H33 in the V3 loop of HIV envelope protein, and 
computationally evolving a chemical ligand using an effective algorithm with spatial 
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constraints, so that the evolved ligand includes at least one effective functional group that 
binds to the motif The template provided may further include topological and/or 
electrostatic attributes, and the effective algorithm include topological and/or electrostatic 
constraints. Similar method steps would be employed for other proteins comprising a motif 
5 identified by the coincidence detection method. 

The invention further provides a method of identifying a ligand to bind with a 
structural motif of a protein. The structural motif is preferably identified by the coincidence 
detection method. For example, in the case where the motif is that identified by the 
coincidence detection method comprising residues A18, Q31 and H33 of HIV envelope 

10 protein discussed above, the method includes the steps of: providing a template having 

spatial coordinates of A18, Q31 and H33 in the V3 loop of fflV envelope protein, providing 
a data base containing structure and orientation of molecules, and screening the molecules in 
the data base to determine if they contain effective moieties spaced relative to each other so 
that the moieties interact with the motif. The data base may further contain topological 

15 and/or electrostatic attributes of the molecules, and the screening step further include 

determining if the moieties are effective in such regard for interacting with the motif. For 
example, a molecule described in the data base may have such physical/chemical attributes 
that it includes a first moiety that interacts with residue 18, a second moiety that interacts 
with residue 3 1 and a third moiety that interacts with residue 33. Similar method steps 

20 would be employed for other proteins comprising a structural motif of interest. 

Where a ligand provided by the invention is included in a pharmaceutical 
composition, the pharmaceutical composition further includes a pharmaceutical^ acceptable 
carrier as is known to persons skilled in the art relating to pharmaceutical compositions. 
The term "pharmaceutical^ acceptable carrier" as used herein include diluents such as saline 

25 and aqueous buffer solutions and vehicles of solid, liquid or gas phase, as well as carriers 

such as liposomes (Strejan et al. 1984. J. Neuroimmunol 7:27), and dispersing agents such 
as glycerol, liquid polyethylene glycols, and the like. The pharmaceutical composition may 
include any of the solvents, dispersion media, coatings, stability enhancers, antibacterial and 
antifungal agents (for example, parabens, chlorobutanol, phenol, ascorbic acid, tfrmerosal), 

30 isotonic agents (for example, sodium chloride, sugars, polyalcohols such as mannitol) and 
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absorption delaying agents (for example, aluminum monostearate and gelatin) which are 
known in the art. 

Alternatively, a ligand provided by the invention, such as a ligand which binds to a 
biological target, may be employed for diagnostic purposes, A diagnostic agent according to 

5 the invention may include a ligand that interacts with a protein having a structural motif 

identified using the coincidence detection method, and a detectable label linked to the ligand. 
The detectable label may be any detectable substance known in the art, such as, for example, 
a fluorescent substance or a radioactive substance. Alternatively, the label may be an 
enzyme (such as, for example, horseradish peroxidase or alkaline phosphatase) which 

10 catalyzes a reaction having a detectable (e.g., colored) product, or the label may be the 
substrate for such an enzyme. 

Application of the Principles Described to Drug Discovery Background: 

The multi-billion dollar pharmaceutical industry is based in large part on the design 
or discovery and refinement of small molecules ("ligands") that interact with larger 

15 molecules ("targets") and in some way repress, enhance, block, accelerate or otherwise 
modify the structure, function or activity of the target. It is the structure, function or 
activity of the target that is in some way implicated in some mechanism of disease. The 
target molecule is often an enzyme or protein receptor or nucleic acid or some combination 
thereof There are a great number of possible ligands and only some relatively very few of 

20 them are developed and marketed as therapeutic compounds that work with or against some 
one or more targets and thus are effective against disease. 

It is therefore of great interest to biotechnology and pharmaceutical researchers to be 
able to consider a huge number of potentially useful compounds, but to avoid spending too 
many resources developing therapies based on compounds that may turn out not to be 
25 useful, safe, effective, and economically viable. The methods described herein can be used 
to enhance and accelerate the process of discovering good, effective compounds and of 
distinguishing the promising compounds from the unpromising or less promising compounds 
in a public or private collection of molecules or their computer database representations. 
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9 



They can be used effectively and contribute value in this application in many ways, by 
helping to understand and infer target structures and by finding ligands whose geometric, 
topological, electrostatic or other features make them likely candidates for effective 
interaction with the targets. 

5 Application of the Principles Described Herein to Databases of Molecules and 

their Features 

One way to represent a large number of molecular structures within a computer 
database (whether stored in main memory, on magnetic disk, tape, or other electronic or 
optical media) is in terms of "screens". Persons skilled in the art will recognize screens as 

10 binary attributes wherein a given screen, or attribute, represents the presence or absence of a 
particular substructure pattern, for example, a sulfate group. If a set of compounds is 
represented with screens, then a particular compound, which we will denote by C, can be 
represented by a string of Is and Os wherein the Is stand for those pre-defined substructure 
patterns that C contains and the 0s stand for those of the pre-defined substructure patterns 

15 that C does not contain. 

This scheme can be extended to the representation of the primary structure of a 
nucleic acid or protein in terms of attributes, as discussed elsewhere herein. The primary 
structure is also known as the "sequence", that is, a sequence of bases, or nucleotides, in 
DNA or RNA, and a sequence of amino acids, also called amino acid residues, in a protein. 

20 It is simple to represent a protein sequence, for example, as a sequence of symbols, each 

symbol being a letter of the alphabet corresponding to one of the twenty standard naturally- 
occurring amino acids. It is also simple to transform this representation by representing each 
residue, or position, in the sequence by a set of twenty binary attributes, if such a 
representation is desired. The attributes act like the screens described above. For example, 

25 if the first amino acid in protein P is an alanine, represented by A, it can also be represented 
by a value of "1" in the attribute that stands for the question, "Is the amino acid in position 1 
an alanine?", and by values of "0" for the attributes representing "Is the amino acid in 
position 1 a cysteine?", "Is it a phenylalanine?", and so on. Figure 1 5 provides an illustration 
of amino acid and residues positions. 
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It is also easy and sensible to represent other aspects or features of the compounds in 
terms of attributes. For example, a given compound C may be known to be active against a 
particular target T, in which case an attribute corresponding to the question "Active against 
T?" would have the value 1 for the object corresponding to compound C. For another 

5 example, a pharmaceutical company may have run a number of compounds through a set of 
"assays", or tests of biological or chemical activity. An assay might test for some aspect of 
effectiveness against a target, or for ability to cross the blood-brain barrier, or for toxicity, 
for example. Assay results can be represented in terms of discrete-valued, and even binary 
attributes as well, via preprocessing routines known to persons in the art. Other features of 

10 particular compounds can include literature citations (that is, references to papers or studies 
in which the compound was described, designed, discovered or analyzed), and ownership or 
patent status of the compound. 

Not only can small therapeutic compounds be represented in terms of screens and 
other attributes, but so can larger potentially therapeutic molecules such as DNA, RNA, 
peptides, proteins, carbohydrates and lipids. Target molecules can also be represented in 
this way. All that is required is a predefined (though possibly updated, changing, shrinking 
or growing) list of substructural patterns or other features deemed important by the 
researchers or users. For target structures, one might want to represent substructural 
patterns as well as their 1 -dimensional linear structures ("sequence"), genetic linkage 
information, interactions with other proteins in disease pathways, literature citations, and so 
on. Sometimes a particular molecule might be listed as more than one object in a database, 
the different objects representing different conformations that the molecule can take. 

Clearly, this use of screens and other attributes in representing compound databases 
can also be represented in terms of the M by N data matrix we have used to describe the 
25 working of the invention. The M by N data matrix is illustrated below in Table 1 

The rows in Table 1 correspond to a set of molecules, compounds, molecular 
structures or sequences, while the columns correspond to features that may include 
substructural patterns, assay results or other aspects of the molecules. The value in table 
cellfi, j] is one (1) if molecule i has feature j and is zero (0) otherwise. 
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Feature 1 


Feature 2 




Feature N 


Molecule 1 


1 


0 




1 


Molecule 2 


0 


1 




0 












Molecule M 


0 


0 




1 













Table 1 



Steps involved in applying the methods described herein to the analysis of a molecular 
database include: 

1 . Obtain molecular database that supports discrete attribute representation for the ID, 
2D and/or 3D molecular structures of interest (or, obtain molecular database and use 
standard methods to produce such a representation); also use standard methods to 
transform sequence and other information about molecules of interest into attribute 
representations. 

2. Present this database, in whole or part, to an embodiment of the current invention 
such that each compound in the database corresponds to one or more of the M 
objects (rows) in the embodiment's data matrix and so that each screen-represented 
substructure pattern corresponds to an attribute (column) of the data matrix. The 
additional attributes representing activity, assay results, knovn targets against which 
the compound has been used, source or means of production or storage of the 
compound, ownership or patent status of the compound, and so on, plus the 
substructure pattern attributes together comprise the N attributes (columns) in the 
data matrix. 

3. Employ the base method above or one of the other embodiments described herein on 
the data matrix. 

4. Direct the discovered correlated k-tuples of attributes to: 
A graphical viewer, or 

• A rule-generator preprocessor for rule-based system, or 

• A report for users, researchers or managers, or a report-generation system, or 
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• Another computer program that performs some kind of further analysis of the 
compounds, sequences, or structures represented in the database, or 

• Another computer program that performs some transformation or optimization 
on the database, or 

5 • Another computer program that directs humans and/or robots in drug screening 

experiments or in design, refinement or production of therapeutic compounds. 

The output of the current invention, in this drug discovery application, can be useful 
in many possible ways. 

First, it can be used in setting up or optimizing a screen-based representation of 
10 molecules. For example, it is known in the art that a good screen-based representation 
should use a set of screens (attributes) that are mutually uncorrected and roughly 
equiprobable. The method of the current invention would produce, when used as described 
above, sets of correlated screens; this information can be used to add, remove, or combine 
the features that the screens represent, in order to make the modified set of screens closer to 
15 the ideal of uncorrected and equiprobable. 

Other useful and valuable aspects of the information produced by the method include 
the following. 

For example, it is not uncommon for a pharmaceutical company to have good "lead 
compounds" that work in in vivo or in vitro experiments even when the researchers do not 
20 know the target structure, the active site on the target structure, or even which of several 
proteins in the biological system is the target. If the methods described herein are used to 
discover correlations among substructural patterns and assay results, this information can aid 
in inferring a target structure and designing even more effective lead compounds, because it 
allows researchers to associate structure with desired activity. 

25 Another example is that of finding correlated amino acid residues in that part of a 

drug discovery database corresponding to an aligned set of DNA, RNA or protein 
sequences, as discussed later herein. In this case, some of the correlated k-tuples of residues 
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(positions) may correspond to evolutionarily conserved structural and functional 
relationships. Therefore the principles described herein can in this way be used to help 
predict or solve the structure and function of important biological macromolecules, including 
pharmaceutical targets such as receptors and enzymes. 

Another example is to find correlations between structural, functional, disease 
pathway or other aspects of one target molecule, Tl, and another target molecule, T2; or 
finding correlations between structural, functional or other aspects of a set of potential 
therapeutic compounds aimed at Tl and those of a set of potential therapeutic compounds 
aimed at T2. In either case, this correlation information is useful because it allows drug 
designers to apply knowledge, compounds and techniques effective against Tl to the effort 
against T2. 

Another rather different application of the principles described herein to drug 
discovery and medical science is obtained by considering the transpose of data matrix 
described above. Instead of compounds as objects (rows) and features of the compounds as 
attributes (columns), consider what is possible when the compounds correspond to columns 
and their features correspond to rows. See Table 2 below. Use of the current invention in 
this scenario produces correlated k-tuples of compounds in feature- space. These produced 
k-tuples can embody several kinds of valuable information. For example, if the features in 
the rows represent mostly substructural patterns (screens), then the produced k-tuples 
correspond to clusters of compounds. Such clustering of compound databases is very useful 
in high-throughput screening (HTS), with both biological/chemical assays (in vitro or in 
vitro) and computational assays. In HTS, it is useful and economical to assay only one or a 
few members of each cluster of compounds initially; then, only in the cases where a "hit" 
occurs (that is, a compound "passes" the "test" in the assay of biological or chemical 
activity) do other members of the corresponding cluster get sent through the assay. 

Use of the method on the "transpose" of the molecular database shown earlier, in 
order to cluster the compounds in feature-space is shown in Table 2. It is now the columns 
that correspond to a set of molecules, compounds, molecular structures or sequences, while 
the rows correspond to features that may include substructural patterns, assay results or 
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other aspects of the molecules. There are M' rows and N' columns, where perhaps M'=N 
and N'=M, for the original M and N described above. The value in table cell[j, i] is one (1) if 
molecule i has feature j and is zero (0) otherwise. 





Molecule 1 


Molecule 2 




Molecule N' 


Feature 1 


1 


0 




1 


Feature 2 


0 


1 




0 












Feature M' 


0 


0 




1 













Table 2 



Application of the Principles Described Herein to Discover and Analyze 
10 Genetic Networks 

Advanced molecular biological and computational techniques applied in large-scale 
genome mapping and sequencing efforts are beginning to give us access to the sequences of 
complete genomes, the complete expression patterns of genes, and the ability to store and 
manipulate this information. Such information can be used to accelerate the discovery of 
15 new disease targets and successful therapeutic compounds. It is known that the genes that 
form the "blueprint" for particular physical traits and systems within an organism often act 
together in complex ways. Genes interact in mutually regulatory ways, promoting, 
repressing and otherwise modulating their own and each others' activation and expression. 

Traditionally, molecular biology has focused on the study of individual genes in 
20 isolation. However, to understand complex biological phenomena like neural development 
or oncogenesis, for example, it is necessary to study the expression patterns of tens or 
hundreds of genes in parallel, taking into account temporal patterns as well as anatomical 
patterns. Such analysis requires novel computational and statistical capabilities, such as 
those provided by the principles described herein. 
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While many variations are possible and can be envisioned by those in the art, a basic 
scheme for employing the methods described herein in the analysis of genetic networks 
might include the following steps: 

Step 1 : Select the genes of interest. 

5 Step 2: Select the biological parameters by which to represent the status of a gene at a 
particular time. Biological parameters can include: expression of a gene (concentration 
levels of the associated mRNA or protein product, a particular status of a protein such as a 
biologically relevant phosphorylation or any other post-translational modification, the 
location of a given protein, or the presence or absence of a cofactor. For example, one can 

10 use polymerase chain reaction (PCR) techniques to amplify, then use known methods to 

detect mRNA levels for each gene, then normalize these by dividing by maximum expression 
levels for each gene, and then quantize these continuously varying levels into a set of z 
discrete levels that can be represented in the data matrix format described throughout this 
document. It is also possible to use concentration levels of protein products as indicators of 

15 gene activity and interactivity. The change, over timed observations, of concentrations of 

proteins is governed mainly by three processes: direct regulation of protein synthesis from a 
given gene by the protein products of other genes (including auto-regulation as a special 
case); transport of molecules between cell nuclei; and decay of protein concentrations. 

Step 3 : Select a scheme for time-sampling the biological parameters of the genes in the 
20 genetic system under analysis. At each appropriate time, use methods known in the art to 
measure the selected biological parameters for the selected genes. 

Step 4: Represent the selected genes in terms of the selected biological parameters, and 
represent the measured values of the biological parameters as attributes in the data matrix. 
Represent the time-samples (the instances of measurement of the biological parameters) as 
25 rows in the data matrix. That is, for a cell in the data matrix, in the /th row and yth column, 
enter the quantity or feature measured in the /th time-sample for the yth biological paramter 
(which may correspond to the yth gene, or it may not, depending upon whether on? or more 
parameters are measured for each gene). The recorded quantity, level or feature may be 
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binary (e.g., the gene is. "on" or "off'), or may be one of z discrete values. As described 
elsewhere in this document, any discrete-valued attribute can be represented by a binary 
encoding of whether that value is absent or present in a given object, so that any of the 
preferred embodiments of the current invention can be applied to data of this type. 

Step 5: Employ the base method described above or one of the other embodiments 
described herein on the data matrix. 

The output of the above steps, that is, a set of k-tuples of correlated attributes, can 
be interpreted as a set of cliques of correlated genes. For example, one might discover that 
one gene is "on" whenever another gene is "on". Or one might discover that when one gene 
Gl is in "low expression", another gene G2 is "off'; when Gl is in "medium expression", G2 
is in "low expression"; and when Gl is in "high expression", then G2 is in "medium 
expression". Such a result might lend support to the hypothesis that Gl promotes the 
expression of G2, or that "Gl turns G2 on". Similarly, correlated k-tuples of genes or 
biological parameters might provide evidence that one gene represses, or "turns off' another 
gene or set of genes, and so on. All such information can be useful in building a model, for 
example a "boolean network", of a set of interacting genes. Such models are known to 
those in the art as providing valuable assistance in diagnosing, preventing and curing disease 
and in designing effective and economically valuable therapeutics. 

The rows in Table 3 correspond to a set of time-samples (a.k.a., time points, time- 
slices), that is, times or periods of observance of the activity of a particular gene or gene 
product. The columns correspond to particular genes or gene products. The value in table 
cell[i, j] is one (1) if gene i is considered "on", that is, e.g., "active" or "expressed", during 
time j and is zero (0) otherwise. This representation and application is easily extended to 
situations in which the simple on/off status of a gene is replaced by a set of z distinct levels 
of expression, for example, as measured by observed quantities of a gene's main protein 
product. It is also easily extended to situations in which more than one biological parameter 
is used to represent the status of a single gene. 
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Gene 1 


Gene 2 




GeneN 


Time 1 


1 


0 




1 


Time 2 


0 


1 




0 












Time M 


0 


0 




1 













5 Table 3 

The methods described herein have been applied to a set of gene expression data for 
genes involved in the development of spinal cord in rats, as described in (G.S. Michaels, 
D.B. Carr, M. Askenazi, S. Furhman, X. Wen, and R. Somogyi, Pacific Symposium on 
Biocomputing 3:42-53, 1988). The dataset is available from those authors and as of March, 
10 1998 is also available over the world-wide web (WWW) at http:/frsb.inft ),nih go v /mftl- 

phv5riol/P>JAS/GEMtable.html. 

Using a reverse-transcriptase polymerase chain reaction (RT-PCR) protocol, the 
expression of 1 12 genes (mRNA levels, normalized by maximal expression level) was 
assayed over nine developmental time points (Ell, E13, E15, E18, E21, PO, P7, P14, and 
15 P90 or adult, wherein E=embryonic, and P=postnatal). Included in the list of genes used are 
genes considered important in CNS (Central Nervous System) development covering nine 
major gene families. 

The dataset mentioned above was easily transformed into a data matrix of objects 
and attributes, convenient for analysis with the methods described herein, in a few steps: 

20 1 . The real-valued (that is, continuously- valued) gene expression levels were 

transformed into a set of discrete values by use of a Bayesian clustering method 
as embodied in the SNOB software, described in (C.S. Wallace and D.L. Dowe, 
"Intrinsic Classification by MML - the SNOB program", Proceedings of the 
Seventh Australian Joint Conference on Artificial Intelligence, pp.37-44, 1994). 

25 Bayesian methods of quantizing or discretizing real numbers are well known to 

persons skilled in the art. For convenience of interpreting output, these six 
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discrete numerical values were then further transformed into a small set of 
alphabetic symbols, A through F. 
2. A data matrix was set up such that the columns of the matrix correspond to the 

1 12 different genes and such that the rows of the matrix correspond to the nine 
5 different developmental time points. 

The methods described herein were then run on the transformed gene dataset input, several 
times, each time using a different combination of values for the parameters r (sample size) 
and T (number of sampling iterations). The method can be applied to this dataset by use of 
10 a computer program very similar to the embodiment described in Appendices A and D; 

however, that particular embodiment was tailored for application to the protein sequence 
analysis domain, meaning that some of the parameter values were fixed to be appropriate for 
those particular trials on the HIV protein data. The program must be modified to allow for 
parameter values appropriate to the input data. 

These runs on the gene expression data were performed on an IBM PC-compatible 
computer under the Windows '95 operating system. For each run, a table of results was 
printed out for viewing and analysis. The results of one run, for T= 100,000 and r=5, is 
attached as Appendix E. A researcher may wish to only print out the top 10, or 50, or 1000 
(or any other number) most highly correlated k-tuples of genes. In Appendix E, the top 25 
are shown. 

In the attached results printout, the following format convention was used: 
Each group of one or more lines reports one correlated k-tuple of genes, that is, one 
cset (coincidence set) which displayed a low probability of its individual component 
attributes being statistically independent, as described elsewhere in this document. 
Low probability of independence is a form of high correlation, as known to persons 
skilled in the art and as explained earlier in this document. For each k-tuple, the k 
genes are shown, followed by a numerical value for their probability of 
independence. (This number often displays as zero, because the calculated value is 
so small, so close to zero, that the decimal expansion is truncated to zero). Again, 
low probability value means high degree of correlation. For each gene, the symbol in 
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A. . . r is shown, representing the quantized level of expression, followed by the 
internal dataset name for the gene, followed by the more standard accepted name for 
the gene. 

The correlated k-tuples produced can be compared to the results reported by the 
authors in the aforementioned scientific paper. Among the analysis methods employed by 
those authors on this gene expression dataset was a pairwise mutual information analysis. In 
such analysis, a particular correlation measure, known as mutual information, was measured 
for each pair of the 1 12 genes, and the results were displayed graphically so that groups of 
genes with mutually high mutual information tend to appear close to each other. The 
method described herein is able, as shown by the results in Appendix E, to discover not only 
highly-correlated pairs of genes, but also 3-tuples, 4-tuples, and so on. Examination of the 
results in Appendix E and the results of the authors of the previously cited scientific paper 
shows that the two different methods tend to corroborate each other but that the current 
method goes farther in finding correlations among large numbers of attributes. For example, 
an examination of any line of output of our results reveals a set of correlated genes such that 
the different pairs of genes in that set are usually also listed as having high pairwise mutual 
information by the other authors' method. 

It is not always true that a correlated k-tuple of attributes implies that all possible pairs, 
from that k-tuple, are also mutually correlated, nor vice versa. Therefore, a method like 
those described herein, that can find pairwise and higher-order k-ary correlations, offers 
advantages over pairwise methods which can fail to detect important higher-order 
correlations among genes or among other attributes in other applications. 



Application of the Principles Described Herein to the Discovery of Categories 
in Internet/Intranet Document Databases for Use in Document Search Engines 

25 Document search by topic or keyword implies the existence of an efficient search 

engine and, indeed, much effort has been applied to the development of effective search 
algorithms. This, however, only represents a part of the total solution - the problem also 
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requires an effective document categorization strategy. Information theory dictates that an 
effective set of categories, or topics, used to organize documents should be uncorrelated and 
roughly equiprobable. When these topics occur with widely-varying probabilities, the search 
space of documents will be either too broadly or too narrowly divided by some topics. If 

5 correlations exist between the topics (that is, where knowledge of the existence of a topic 
within a given document implies a greater probability that other topics will be found within 
the document as well) then the topic set can be reduced in size (by removing some of the 
correlated topics from the categorization set). The "equiprobability" concern can be 
addressed by the application of the principles described herein. This problem yields readily to 

10 statistical techniques, but standard statistical techniques usually fail to capture higher-order 
joint probability terms. The "decorrelation" problem is much more subtle and intractable. A 
sub-optimal topic set forces the search engine to examine more such topics than necessary 
before the results can be returned to the users (and may confuse interpretation of the 
organization of the documents themselves). Given that every increment in search efficiency 

15 allows greater numbers of users to use the system, the developers of such systems can not 
afford a lack of effective categorization of documents. 

Application of the method to optimal or near-optimal topic set reduction can also be 
represented in terms of the M by N data matrix we have used to describe the working of the 
invention in other sections of this document. In one application-specific embodiment, the 
20 rows of the data matrix correspond to particular documents in the database; and the columns 
correspond to a proposed topic set that is intended to categorize them. (See Table 6). 

The rows in Table 6 correspond to documents in a database, while the columns 
correspond to proposed topics used to classify them. The value in table cellfi, j] is one (1) if 
document i mentions topic j and is zero (0) otherwise. 





Topic 1 


Topic 2 




Topic N 


Document 1 


1 






1 


Document 2 


0 


1 




0 












Document M 


0 


0 




1 
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Table 6 

Steps involved in applying current invention to a search for a near-optimal topic set 
with which to classify a set of documents include: 

1 . Obtain an initial topic set. The field of document search is well established and 
effective methodologies for the creation of such sets are known to those skilled in 
the art. 

2. Create the database using this topic set and the set of documents that the topic set 
categorizes. Given the topic set, all one need do is examine each document to 
determine whether or not it mentions each topic. 

3 . Present this database, in whole or part, such that each document in the database 
corresponds to one or more of the M objects (rows) in the embodiment's data matrix 
and so that each proposed topic corresponds to an attribute (column) of the data 
matrix. 

4. Employ the base method above or one of the other embodiments described herein on 
the data matrix. 

5. Direct the discovered correlated k-tuples of attributes to: 

• A graphical viewer or printer, or 

• A rule-generator preprocessor for rule-based system, or 

• A report for administrators or other users of the computer database query 
system, or a report-generation system, or 

• Another computer program that performs some kind of further analysis of the 
data, for example, performing more in-depth statistical analysis (e.g., multiple 
regression) on the correlated variables, or 

• Another computer program that performs some transformation or 
optimization on the database. 
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Any statistically significant correlation between topics in the topic set may indicate 
an ineffective initial choice of topics. The correlated k-tuples discovered by the method of 
the current invention correspond both to "highly correlated topics" (with respect to the 
"decorrelated topics" goal) and to "highly probable joint topics" (with respect to the 

5 "roughly equiprobable topics" goal). A person skilled in the art can use the correlations 

output in this application, as a guide to determining which topic(s) found to co-occur should 
be removed or combined from the topic set. Using the output of the application in this way 
would allow the administrator of such a document search engine to increase the performance 
of the system by reducing the number of categories to be searched in response to a user's 

10 query. The enhanced performance of the system would benefit the provider of the service in 
two ways: the response time of the system to user's queries would decrease and the total 
number of users that can be served would increase. 



Applications of the Principles Described Herein to Internet and Intranet 
Search and Storage 

15 Internet and intranet search engines can be ranked subjectively by examining the 

length of time needed for users to find sites or documents of relevance to their query. Any 
improvement to the underlying algorithms that drive the search engine's output that allows 
users to find what they're looking for sooner improves the usefulness of that engine, allows it 
to serve more users and makes it more attractive to both the communities of users and 

20 advertisers (in the case of internet search) and users and management (in the case of 

company intranet search). Presented below are two uses of the principles described herein 
that will provide ways to get relevant information to users sooner and to better manage the 
storage of documents on internet or intranet search systems. In the descriptions and 
examples below, the principles discussed apply equally whether one is considering the 

25 internet/web and hence individual web pages and websites, or intranets, maintained within 
the information systems of a single company or other institution, in which case the search is 
for documents rather than websites per se. 

For the purposes of elucidating this description, assume that each page in the set of 
web pages, or internal intranet documents in the set of such documents, known to the search 
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engine has already been classified by topic and that the set of topics is fixed a priori. The 
goal is to present the user with the normal output of the search engine but to supplement 
that list of links with an additional list of topics known to be i elated to the user's request. 

The rows in Table 7 correspond to a set of web pages, or internal intranet 
5 documents, while the columns correspond to topics. The value in table cell[i, j] is one (1) if 
web page or document i mentions topic j and is zero (0) otherwise. 





Topic 1 


Topic 2 




Topic N 


Page 1 


1 


0 




1 


Page 2 


0 


1 




0 












Page M 


0 


0 




1 



Table 7 



Table 7 illustrates the database upon which the base method or other embodiment 
described herein will be run, in the data matrix format for representing objects and attributes 
that have been defined and described elsewhere herein. Note that, because of the 
15 characteristics of the embodiments described herein, the number of pages used in the table 
need not be the entire set of all web pages. The embodiment, when run (or employed) on 
this table will find those topics that are frequently found in the same document together. 
This indicates that these topics are related in some fashion and, as the set of web pages 
supports their association, they may be of interest to the user as well. 

20 The advantages are several. The computational expense of these embodiments scales 

linearly with respect to the number of columns in the database. In this application, the 
number of columns represents the number of topics associated with web pages. As this 
number is almost certainly very large, this characteristic of the method is a real benefit. In 
addition, if the web pages are kept in random order, the embodiments can be run on more 

25 manageable subsets of the entire set of web pages. This allows the job of finding these 
associations to be divided into much smaller jobs which can be run, serially or in parallel, 
during idle times on the server where the search engine resides. This method can produce 
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novel associations of great width (k) at any point during its execution. Many other 
"association mining" methods only find longer k-tuples of associated attributes at later stages 
in their long execution times. Lastly, as the list of associated topics found by this algorithm 
grows, the pages that select the links for these new "joint topics" can be created and cached. 
This would reduce server loads (thus allowing more users to access the system). As this 
also puts bounds on the statistical relevance of the findings, this information could be used to 
select which new topic indices would be cached and which would be re-created as needed. 

Alternative Application of the Principles Described Herein to Manage the 
Storage and Retrieval of Web Pages and Documents: 

Internet and intranet search engines attempt to order the space of web pages or 
documents by topic. Generally, an initial (e.g. alphabetic) ordering is not at all likely to 
evenly divide that space. For example, the topic "California" will have a vastly greater set of 
pages associated with it than will "North Dakota". A simple tree-like storage of the pages 
by topic (with sub-topics at lower levels of the tree) will leave "California" with a very deep 
tree. What would be of use in this situation would be some better way to divide the search 
space of pages than by just single topics. In the noted example, it would be better to have 
the large set of California-related web pages divided into smaller sets closer to the size of the " 
set for North Dakota We can keep our ordering of the pages by topic if we choose to 
divide larger sets into smaller ones by replacing the single topic describing the set with a 
series of associated topic lists that encompasses the same space. Going back to our 
example, if "California" were only strongly associated with "Sunshine", "Wine" and "Cars" 
we would replace the tree node "California" with the set of nodes "California and Sunshine", 
"California and Wine", "California and Cars", "California and Other". This will allow faster 
lookup and storage of these pages because it reduces the height of this part of the tree (in 
this case) by one. Recursively applying the same technique at all nodes in the tree would 
provide a method for ensuring better balance than could have been had before. The only 
thing missing from this formulation of the new tree balancing function is the discovery of the 
associations themselves. An application of embodiments described herein to the same table 
discussed in the previous section extracts this information from the set of pages. The 
method tells us not only which topics are related but also gives an indication of the level of 
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support for each association in the database. Once a problematically large topic has been 
identified, the list of associations found by the algorithm that includes this topic can be 
consulted to determine how to divide the topic. 

The use of tree-based storage retrieval techniques is known to those in the art, and 
5 such methods include such variations as B-trees, k-D trees, tries, k-D tries, and gridfiles. 
Hashing schemes can also be used instead of, or in addition to, tree-based methods per se. 
With all such methods, there are efficiency gains to be made, in both storage (main memory 
and offline memory) and running time, by taking advantage of particular distributions of the 
data in the application domain. The embodiments described herein can, as shown above and 
10 in other ways, be used to obtain a better understanding of and exploitation of the distribution 
of the data. 



The advantages include all those listed for the first alternative above with one 
significant addition - if one is already using the method to find lists of sites related to a given 
query, then one is already compiling the exact list of associations that is needed here to help 
15 balance the search tree. 

Application of the Principles Described Herein to Sales Analysis, Direct Mail 
and Related Marketing Activities 

Marketing executives, within retail sales companies, advertising/marketing agencies, 
magazine, newspaper, radio, television, film and internet companies, and non-profit and 

20 charitable organizations, need to know which kinds of people are likely to buy or contribute. 
In all these and other marketing contexts, it is very useful and valuable to be able to analyze 
data both from previous marketing campaigns (we'll use the term "mailings", though other 
campaigns and promotions are also included) and from previous purchases of the relevant 
good and services, or previous contributions to charities (let us refer to all these as 

25 "products"). 

It is useful for marketing executives, salespeople and management to know such 
things as, for example: 
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Which products tend to be bought together (by same customer, perhaps within same 
transaction)? 

Which of our previous advertising campaigns or mailings produced good response 
(high sales of a product) and which did not? 

5 Which demographic factors correlated with large total spending on our companies 

products last year? Are 25-40 year old females in the Midwest region buying our 
products? 

Such questions can be addressed by the analysis of databases organized in terms of 
customers, transactions, demographic factors, previous marketing campaigns, and sales of 
particular products. For charitable organizations, the basic idea is the same, though instead 
of "sales" and "customers" the application is to "contributions" and "donors", for example. 
The principles described herein can be applied successfully to these analysis tasks, wherein 
one of the main current computational challenges is the discovery of associations 
(correlations) amongst sets of variables or attributes in very large databases. Table 8 
illustrates the application to the analysis of databases on customer purchases of products. 
Table 9 is similar except that it illustrates the case wherein not only purchases are recorded 
in the data, but also information on previous marketing campaigns. Either of these schemes 
may be augmented by the inclusion of additional columns corresponding to demographic 
attributes of the customers, for example region of residence, age group, income group, 
gender, occupational category, and participation in community- or leisure-related activities. 

The rows in Table 8 correspond to customers (and/or potential customers), while the 
columns correspond to products (goods or services) that were either purchased (denoted by 
1) or not purchased (denoted by 0) by particular customers. The value in table cell[i, j] is 
one (1) if customer i has purchased product j and is zero (0) otherwise. 





Product 1 


Product 2 




Product N 


Customer 1 


1 


0 




1 


Customer 2 


0 


1 




0 
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Customer M 


0 


0 




1 



Table 8 



The rows in Table 9 correspond to customers (and/or potential customers), while the 
5 columns correspond to mailings (or other marketing campaigns) and products (goods or 
services) that were either purchased (denoted by 1) or not purchased (denoted by 0) by 
particular customers For the Mailing columns, the value in table cell[i, j] is one (1) if 
customer i was sent mailing j and is zero (0) otherwise. For the product columns, the value 
in table cell[i, j] is one (1) if customer i has purchased product j and is zero (0) otherwise. 





Mailing 1 




Mailing nl 


Product 1 




Product n2 


Customer 1 


1 
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Customer 2 
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Customer M 


0 




1 


1 




0 



Table 9 



15 Steps involved in applying the principles described herein to a sales/marketing 

database include: 

1 . Obtain sales/marketing database as described above. Where necessary, use methods 
known in the art to transform continuous-valued variables into discrete-state 
variables. 

20 2. Present this database, in whole or part, such that each customer in the database 

corresponds to one or more of the M objects (rows) in the embodiment's data matrix 
and so that each product or mailing corresponds to an attribute (column) of the data 
matrix. Mailing attributes (if any) plus product attributes together comprise the N 
attributes (columns) in the data matrix. 
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3 Employ the base method above or one of the other embodiments described herein on 
the data matrix. 

4. Direct the discovered correlated k-tuples of attributes to: 
A graphical viewer or printer, or 

• A rule-generator preprocessor for rule-based system, or 

• A report for marketing personnel, magazine/newspaper circulation directors, 
salespeople, managers or other users of the computer database query system, 
or a report-generation system, or 

• Another computer program that performs some kind of further analysis of the 
data, for example, performing more in-depth statistical analysis (e.g., multiple 
regression) on the correlated variables, or 

• Another computer program that performs some transformation or 
optimization on the database. 

The output in this application, can be useful in several possible ways. 

For example, the output may include correlated k-tuples which comprise sets of 
products that tend to be bought together, either within the same transaction or by the same 
customer across different transactions. Such information can be used to develop "tie-in" and 
co-marketing campaigns, such as, for example, when buyers of NBA basketball tickets are 
given coupons for discounts on NBA team shirts, basketball shoes, and other basketball- 
related merchandise. While it is perhaps not surprising that basketball fans like to wear NBA 
team shirts, the steps described above are capable of discovering other associations between 
products that are not so obvious. 

For another example, the output may include correlated k-tuples which represent 
particular advertising campaigns correlated with particular product purchases. Such 
25 information can help marketing executives focus their recourses on new marketing 
campaigns of the type most likely to increase sales. 
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Use of the Principles Described Herein in Clustering Customer Data 

Another rather different application of the principles described herein to marketing 
practice is obtained by considering the transpose of the data matrix described above. 
Instead of customers as objects (rows) and products and demographic factors as attributes 
(columns), consider what is possible when the customers correspond to columns and the 
product and demographic variables correspond to rows. (See Table 10). Use of the 
principles described herein to this scenario produces correlated k-tuples of customers, or 
customer profiles, in the space of demographic and purchasing pattern features. This is seen 
to be a form of clustering of the customer data, into groups of customers or customer 
profiles that are roughly similar in terms of their buying habits and lifestyles. Such clustering 
can be useful in designating special "target groups", to enable more optimal allocation of 
marketing resources. Once this transposition of the data is envisioned, the other steps apply 
entirely analogously to the descriptions given above for marketing activities. 

Use of the method on the "transpose" of the marketing database shown earlier, in 
order to cluster the customers is shown in Table 10. It is now the columns that correspond 
to a set of customers, while the rows now correspond to products purchased and 
demographic features. There are NT rows and N' columns, where perhaps M'=N and 
N'=M, for the original M and N described above. The value in table cell[j, i] is one (1) if 
customer i purchased product j or possesses demographic feature] and is zero (0) 
otherwise. 





Customer 1 


Customer 2 




Customer N' 


Prod/Demo 1 


1 


0 




1 


Prod/Demo 2 


0 


1 




0 












Prod/Demo M' 


0 


0 




1 



Table 10 
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Application of the Principles Described Herein to the Analysis of Medical, 
Epidemiological and/or Public Health Databases 

Medical scientists and practitioners have long known that many human diseases and 
disorders, physical and mental, are caused by complex interactions among many potential 

5 contributing factors. Such factors can include particular genetic conditions or abnormalities, 
exposure to biological pathogens, aspects of diet, environment (air, water, noise pollution), 
exposure to hazards in the home or workplace, emotional stress, substance abuse and 
poverty, among others. The true "causes" of a given condition often remains impossible to 
ascertain, though there is much folklore and anecdotal evidence offered in attempts to 

10 explain some instances. The problem of discovery and prevention of health threats is helped 
in recent times by the ability of researchers, insurance company representatives, 
epidemiologists and public health officials to compile and analyze large amounts of data on 
real people, healthy and sick, living and deceased. As in other applications of computers 
and statistical analysis to databases, one must contend in this field with a huge number of 

15 variables and the exponential complexity of their potential interactions. This kind of analysis 
can be improved greatly by methods that efficiently find correlations and associations 
amongst tens, hundreds, or thousands of variables. The principles described herein are 
applicable to such a situation. 

Application to medical databases can also be represented in terms of the M by N data 
20 matrix we have used in other sections of this document. In one application-specific 

embodiment, the rows of the data matrix correspond to particular patients or subjects in a 
health study; and the columns correspond to factors thought to contribute to a given disease 
or set of diseases. Again, these factors can include socioeconomic factors, lifestyle 
(exercise, diet), aspects of the patient's home or workplace environment (e.g., exposure to 
25 carcinogenic chemicals), past medical treatments, and so on. (See Table 11). 

The rows in Table 1 1 correspond to patients or to human subjects in a study, while 
the columns correspond to potential disease factors. The value in table cell[i, j] is one (1) if 
patient i has experienced or been exposed to factor j and is zero (0) otherwise. 
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Factor 1 


Factor 2 




Factor N 


Patient 1 


1 


0 




1 


Patient 2 


0 


1 




0 












Patient M 


0 


0 




1 



5 Table 11 



In some application-specific embodiments, there may be not just one disease 
represented implicitly, but, instead, a number of different diseases, represented as attributes 
along with the factors shown in Table 1 1 and described above. For example, a particular 
patient p may have lung cancer but not diabetes or heart disease, and so row p would have a 
10 1 in the column corresponding to lung cancer and have values of 0 for the columns 
corresponding to diabetes and heart disease. 

Steps involved in applying current invention to a medical/epidemiological/Iifestyle 
factors database include: 



1 . Obtain database of medical/epidemiological/lifestyle factors as described above. 
15 Where necessary, use methods known in the art to transform continuous- valued 

variables into discrete-state variables. 



2. Present this database, in whole or part, such that each patient/subject in the database 
corresponds to one or more of the M objects (rows) in the embodiment's data matrix 
and so that each potential disease factor corresponds to an attribute (column) of the 

20 data matrix. Additional attributes representing different diseases plus the disease 

factors together comprise the N attributes (columns) in the data matrix. 

3. Employ the base method or other embodiments described herein on the data matrix. 

4. Direct the discovered correlated k-tuples of attributes to: 
• A graphical viewer or printer, or 

25 • A rule-generator preprocessor for rule-based system, or 
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• A report for doctors, researchers, public health officials, managers or other 
users of the computer database query system, or a report-generation system, 
or 

• Another computer program that performs some kind of further analysis of the 
data, for example, performing more in-depth statistical analysis (e.g., multiple 
regression) on the correlated variables, or 

• Another computer program that performs some transformation or 
optimization on the database. 

The output of this application, can be useful in several possible ways. 

For example, the output may include correlated k-tuples which comprise sets of 
factors associated with one or more disease conditions. Such information, perhaps refined 
through further statistical analysis, can provide breakthroughs in understand, treating, and 
preventing those particular diseases. 

For another example, the output may include correlated k-tuples which comprise sets 
of factors associated with each other, such associations being previously unknown. The 
discovery of associated lifestyle factors, such as particular diets and obesity or particular 
professions and high levels of alcohol consumption, can itself be useful in improving public 
health policy and medical practice. 

All such discovered correlations can potentially be of great benefit to insurance 
providers, public or private, as they must make their actuarial tables and insurance policies 
reflect accurate predictions of health and life expectancy, for example, based on lifestyle, 
socioeconomic and other factors. 

Use of the Principles Described Herein in Clustering Patient Data 

Another rather different application of the principles described herein to public health 
and insurance policy and practice is obtained by considering the transpose of the data matrix 
described above. Instead of patients as objects (rows) and potential disease factors as 
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attributes (columns), consider what is possible when the patients correspond to columns and 
the factors correspond to rows. (See Table 12). Use of the current invention in this 
scenario produces correlated k-tuples of patients, or patient-profiles, in feature-space. This 
is seen to be a form of clustering of the patient data, into groups of patients or patient 

5 profiles that are roughly similar in terms of their lifestyle factors. Such clustering can be 
useful in designating special "low-risk" or "high-risk" types of patients or insurance 
applicants, to enable more optimal allocation of health services, outreach programs, 
insurance protection, or other resources. Once this transposition of the data is envisioned, 
the other steps of the preceding application to analysis of medical and other databases apply 

10 entirely analogously to the descriptions given above. (See Table 12). 

Use of the principles on the "transpose" of the disease factors database shown 
earlier, in order to cluster the patients or policy-holders in factor-space is shown in Table 12. 
It is now the columns that correspond to a set of patients, medical study subjects, or 
potential insurance policy-holders, while the rows now correspond to potential disease 
15 factors that may include lifestyle factors, socioeconomic factors, workplace factors, and so 

on. There are M 1 rows and N* columns, where perhaps M'=N and N'=M, for the original M 
and N described above. The value in table cell[j, i] is one (1) if patient i possesses or has 
been exposed to factor j and is zero (0) otherwise. 





Patient 1 


Patient 2 




Patient N' 


Factor 1 


1 


0 




1 


Factor 2 


0 


1 




0 












Factor M' 


0 


0 




1 



Table 12 



Application of the Principles Described Herein to the Discovery of the Causes 
25 of Failures in Complex Systems 

Administrators of complex integrated systems such as computer networks and 
factory automation systems have been faced with the difficult diagnosis problems these 
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systems pose since their inception. Where a series of events in the system (perhaps over a 
protracted period of time) leads to a failure of the system as a whole, the diagnosis of the 
true cause of the failure can be an almost insurmountable task. For example, a network 
interface card on a gateway computer that fails intermittently when under high load 

5 conditions may not cause the host computer to crash but may lead to errors on other 

computers that use the card (by proxy) to service their network requests. Such a problem 
would be difficult in the extreme to track down using conventional diagnosis techniques. 
Tools that can present administrators with a better analysis of the conditions on the system 
as a whole that lead to the failure would speed the diagnosis and correction of the under- 

10 lying problem. 

We need to define the database upon which the principles described herein will be 
applied. 

The database as a whole can be thought of as a state record of a series of 
components over time. The columns of this database, when viewed in the data matrix 
15 format used throughout this document, represent the series of components; the rows 

represent discrete points in time. The values in the table are intended to be an encoding of 
each component's state (on, off, idle, error, and so on) at the time in question. Such logging 
procedures are well known to those skilled in the art. 

The rows in Table 1 3 correspond to points in time, while the columns correspond to 
20 individual components in the system. The value in table cellfi, j] is the encoded state of 
component j at time i. 





Component 1 


Component 2 




Component N 


Time 1 


1 


0 




1 


Time 2 


0 


1 




0 












Time M 


0 


0 




1 



Table 13 
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Steps involved in applying the method of the current invention to analysis of a system 
operations database include: 



1 . Create a database of system components and their states as described above. The 
choice of state sets for the components in the system will be driven by behaviors of 
5 interest to the administrators of the system as well as by the components themselves. 



2. Present this database, in whole or part, as a data matrix such that each column in the 
data matrix corresponds to a component in the system and each row in the data 
matrix corresponds to a point in time in the series. 

3. Employ the base method above or one of the other embodiments described herein on 
10 the data matrix. 



4. Direct the discovered correlated k-tuples of attributes to: 



• A graphical viewer or printer, or 

• A rule-generator preprocessor for rule-based system, or 

• A report for the administrators of the system, or a report-generation system, 
15 or 

• Another computer program that performs some kind of further analysis of the 
data, for example, performing more in-depth analysis on the correlated 
variables, or 



The output in this application, can be used to indicate the events in the system that 
20 are typically seen to co-occur with a given failure. Given the formulation of the database, 
we need not restrict ourselves to the states of the components in the system at the time of 
the failure - we can expand our examination of the failure conditions to any range of points 
in time for which the database has records. This allows the method to help illuminate subtle 
causal relationships between components that ultimately lead to failure. In the simplest case, 
25 the output can be used to eliminate some components in the system from scrutiny if it is seen 
that they are not correlated with the failure. 
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Application of the Principles Described Herein to the Analysis of Complex Systems 

Complex systems define a large family of somewhat similar applications. For the purpose of 
this discussion, complex systems are defined as systems for which there are no direct 

5 detailed modeling approaches because these systems comprise a huge number of interacting 
individual components or parts. Examples would include (but would not be limited to) 
economics, individual human behavior, productivity in groups of employees, weather 
patterns, crime in a nation, etc. In each of these cases, there are no known methods to 
model the system exactly so variables or sets of variables are used to measure the state of 

10 these systems (examples in the case of economics would be the interest rate, stock market 
values and inflation rates). For the purposes of this description, the events in these complex 
systems take the form: pre-condition, action and post-condition. These interactions 
represent the state of the system before the actions were taken, the actions themselves and 
the resulting state of the system at some point after the implementation of the actions. Put 

15 another way, the set of previous perturbations of the system and their outcomes are used as 
a history of the system from which to derive information about the system's characteristics. 

The kinds of databases of complex systems that can effectively utilize the principles 
described herein must meet certain restrictions. There must be some set of variables (either 
in common usage or derivable from knowledge in the domain) used to measure the state of 
20 the given system. These variables are used in the pre and post condition parts of each 

database entry. Additionally, there must be some general set of actions that may be applied 
to the system that encompass methods by which it is known the system may be perturbed. 
Returning to the economics example, the action set would include all things under the 
heading of "fiscal policy". 

25 Formally, the database must include attributes representing zero or more pre-condition 

variables, zero or more action variables, and zero or more post-condition variables. Leaving 
aside the trivial case wherein the database contains zero pre and post condition variables and 
zero action variables, there are eight cases to consider. They will be presented exhaustively 
below with examples where appropriate. Note that in each case, there are two 

30 interpretations of relevance. For example, consider the case where we have pre-condition 
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variables and action variables but no post-conditions. The correlations can be derived in two 
ways: the database itself could have had no post-condition variables in it (and the returned 
set of correlations is culled to remove any correlations that involved only variables of one 
type) or it can be that just the set of correlations themselves contain no post-condition 
5 variables even though the database does in fact contain them. For the purposes of the 

discussion, we assume the former is the case - we can always cull the results of the method 
on a database that has more types of variables to leave a set of correlations which do not 
have some types of variables. 

If the database contains only variables of one type (i.e. only action variables or pre or post 
10 condition variables) then the correlations derived from it can be interpreted in one of two 

ways. If the variables are pre or post condition variables, then the results indicate situational 
archetypes - that is, sets of attribute values (or, equivalently, states of variables) that tend to 
be seen together. An example from the domain of weather patterns would be rain and low 
barometric pressure. If only action variables are present in the database then correlations 
15 found between them indicate sets of decisions that tend to be made together. In a military 
domain, we might discover that flanking maneuvers and offensives tended to be seen co- 
occurring. As these types of databases are very similar to others described elsewhere in this 
document (as would be the applications of the method in these cases), this section will not 
explicitly address them. 

20 The cases where the database contains variables of only two of the three types are three in 
number. 

Correlations found in a database that contains only pre-condition and action variables 
describe the relationship between situations in the domain and the selection of actions. An 
example is football play-calling (note that this also involves a complex system that can not be 
25 modeled in any direct detailed way - the play-caller). Here the correlations indicate the 
tendencies of the action-taking entity, e.g., a coach or quarterback. 

If the database contains only action and post-condition variables, then the correlations found 
elucidate the effectiveness of sets of actions regardless of pre-conditions. Going back again 
to the football example, correlations of this type would illuminate the ability of the team in 
30 question to perform certain actions (e.g., if "third and long yardage to first down" tended to 
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result in a poor post-condition set, like fourth down, then we would know that the team 
tended to be ineffective in this situation). Another important example is drug interaction. In 
this case, the actions are the drugs given and the post-conditions are the side-effects 
reported for some patient. 

5 While the utility of the case where the database contains only pre and post condition 

variables may be unclear on first examination, it may well be that this is one of the most 
useful cases. Here *ve are either interested in things that tend to happen after a situation in 
the given domain regardless of actions taken by the H ecision-maker or we are in a domain 
where there are no actions that can be taken (or none that effect the system itself). An 

10 example of the former would be the fact that the pre-condition "third and long" in football 
tends to be followed by the post-condition "fourth and long". In fact, it may be the latter 
case that is the most interesting. Consider that case of weather patterns. If we focus on the 
post-condition "tornadoes" (that is, we cull the resulting correlation set so that it includes 
only those correlations that involve the appearance of "tornadoes" in the post-condition), 

15 then what these correlations tell us are precursor signs that tornadoes are immanent. 

The last case is the most general: the database contains all three types of variables. Note 
that a database of this form is capable of having correlations of attributes of all the preceding 
types. Example domains have already been given (economies, crime in a population, etc.) 
Here the correlations can be thought of as rating actions sets (given some set of pre- 
20 conditions) based on the quality of the post-conditions. 

The last consideration is the types of data that the database entries contain. Binary valued 
attributes, as noted throughout this document, can readily be accepted by this method. 
Other value types must be of limited range of discrete values. Where this is not the case (i.e. 
real-valued or integer-valued attributes), some transformation must be performed on the 
25 values in question to reduce their range of values to a more manageable number. Various 
clustering methods are among the preferred methods for this, and are well-known to those 
skilled in the art. 

In all cases, the correlations returned by the method are ideal inputs to a case-based 
reasoning package. Given a condition of the system (i.e. the current condition), a cased- 
30 based reasoning tool could use the associations found by the principles described herein as a 
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basis for analysis of possible outcomes of selections from the set of actions that can be 
applied to the system. 

Generally, the principles described herein can be used as a tool to aid decision-makers. 
Decision-makers can be "real" or artificial (that is, the method can be used as part of an 
5 artificial intelligence engine whose purpose is to make decisions in the domain of interest). 

Description of the Appplication of the Principles Described Herein to 
Databases with Pre-condition Variables and Action Variables: 

Given the above-noted restrictions on the form of the database, it is clear that the input 
requirements for the application of the embodiments described elsewhere herein are met. In 

10 the convenient data matrix representation cited elsewhere in this document, the M rows in 
this context are the total selected set of pre-conditions and actions taken. If the entity that 
applies the actions can sensibly be personified then these rows can represent a history of the 
decisions made by this entity and the states of the system at the time they were made. The N 
columns comprise the set of state variables that define the state of the system and the set of 

15 all applicable action variables that describe the ways in which the system can be perturbed 
(see Table 14). 

The rows of Table 14 correspond to instances of or combinations of system states (the pre- 
condition of the system) followed by actions taken in response to that state, while the 
columns correspond to variables thought to describe the state of the system and possible 
20 actions that can be applied to the system. The value in table cell[i, p] is an encoding of the 
measure of state variable p in event i if column p is a pre-condition column and is an 
encoding of the action taken in event i if column p is an action column. 
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Pre 1 




Pre i 


Act 1 




Act k 


Row 
1 


C(l,l) 




C(l,j) 


A(l,j+1) 




A(l,j+k) 


Row 
2 


C(2,l) 




C(2j) 


A(2j+2) 




A(2,jHk) 
















Row 
M 


C(m,l) 




C(mj) 


A(mj+2) 




A(m,j+k) 



Table 14 



There are some other considerations that must be addressed prior to the application of the 
10 Principles described elsewhere herein to any given domain. The set of state variables must 
be defined. This is left to those skilled in the domain itself (e.g., football coaches, military 
analysts, etc.) 

Previously noted examples are the case of football play-calling by coaches and military 
decision made by generals. In general, preferred implementations of this invention will use 
15 the method of the current invention on databases of this form in order to extract information 
about the action-taking entity. The correlated state variables and actions describe the 
tendencies of this entity. As noted above, these may be further analyzed using case-based 
reasoning tools to give a better picture of the entity's likely decisions given a state of the 
system. 

20 Another use of the invention on databases of this type is in discovering fraud indicators in 
tax collection. Here we let the pre-conditions be a set of attributes intended to capture the 
salient details of a tax return (such things as total income, total tax owing as reported by the 
individual or business, tax exemptions claimed, etc.) and choose the action variables to 
define a set of possible tax evasion methods. The correlations found by the invention then 

25 indicate associations between types of tax returns and types of tax evasion. As coincidence 
detection bounds the returned correlations statistically, we not only find indicators of 
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evasion but also the reliability of these findings. Given that tax collection agencies can not 
afford to investigate all tax returns sent to them, this method allows them to find a well- 
chosen subset of these returns that is most likely to result in findings of fraud (and greater 
monetary returns for the government). 

5 The last such use that will be presented is in the domain of insurance fraud and is very 
similar to the application of the principles described herein to tax collection. The pre- 
condition variables are intended to capture a set of details in an insurance claim that are 
thought to be possible indicators of fraud (amount claimed, specifics concerning the insured 
entity, etc.) and the action variables represent types of fraud. The results found when the 

10 principles described herein are applied show correlations between the details of insurance 
claims and types of fraud. Insurance companies can not investigate all claims sent to them; 
so, the application of the principles described herein will narrow the total list of such claims 
to a set more likely to be the subject of fruitful investigations. 

Steps involved in applying thprinciples described herein to a database containing pre- 
1 5 condition and action variables include: 

1 . Create the database of system states and actions taken by the action taking entity as 
described above. Where necessary, use methods known in the art to transform 
continuous-valued attributes into discrete-state attributes. 

2. Present this database, in whole or part, such that each states/action set corresponds to 
20 one of the M objects (rows) in a data matrix and so that each state type aspect and 

action type corresponds to an attribute (column) of the data matrix. 

3. Employ the base method or other embodiment described herein on the data matrix. 



4. Direct the discovered correlated k- tuples of attributes to: 

• A graphical viewer or printer, or 

25 • A report for decision-makers, or a report-generation system, or 

• Another computer program that will use the correlations found as a basis for making 
decisions (for example, a case-based reasoning package), or 
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• Another computer program that performs some transformation or optimization on the 
database. 

This application of the principles described herein provides and utilizes a list of correlated 
state/action sets that give insight to the inclinations of the action-taking entity. Were one to 
5 be interested solely in one system state (or in only a few aspects of a given state), for 

example the current state, one could cull the results of any correlations that do not share a 
given set of aspects with that state. The resultant set would represent correlations between 
the aspects of interest and the actions taken in response. The resulting insight into the 
action-taking entity's methodology can be used in further decision-making. 

10 Description of the Principles Described Herein as Applied to Databases with 

Pre-condition Variables and Post-condition Variables: 

Here, too, the above-noted restrictions on the form of the database force compliance with 
the input requirements of the embodiments described elsewhere herein. The M rows in this 
context are the instances or combinations of pre-conditions and post-conditions (viewed 
15 together, one can think of these rows as being the system's transitions between states). The 
N columns are comprised of the set of state variables that define the state of the system 
before and after the transition (see Table 15). 

The value in cell[i, j] of Table 1 5 is an encoding of the measure of state variable j either 
before or after the transition. 
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C(mj+1) 




C(m,j+k) 



Table 15 

10 There are some other considerations that must be addressed prior to the application of this 
invention in any given domain. The set of state variables must be defined. This is left to 
those skilled in the domain itself. 

Equally important is the selection of time quanta that define the granularity of the transitions. 
This too is left to those skilled in the art to decide based on their own expertise and the kinds 
15 of information they wish to extract. It is assumed that some minimum granularity is imposed 
by either the complexity of gathering such data or by the limits of the usefulness of such 
data. Given this, one can then pick any multiple of this minimum granularity to be the time 
between pre and post conditions. At the very least, this distance in time should be long 
enough for the system to have changed it's state. 

20 Possible domains of application for this invention include economics and fiscal policy, stock 
market prediction, athletic talent scouting and weather prediction. Presented below are brief 
descriptions of each in turn to show how these problems may be organized to fit the 
specifications of the method of the current invention. 

In the domain of economics and fiscal policy, we propose a database of sets of states where 
25 the states are a set of economic indicators (inflation and interest rates, housing starts, GDP 
and so on). Each row in the database should contain two such states (the pre and post 
condition of the system) separates by a fixed amount of time. The correlations found in by 
the method of the current invention then give insight into cycles in the economy. 
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For stock market prediction, we propose a set of stocks (presumably large) which are 
thought to have influence over one another. Again, a fixed period of time is selected for 
transitions. The rows of this database then tell the transition of these stocks over the chosen 
period of time. The output of the invention then indicates which sets of stocks "move" in a 
5 correlated manner over that period of time. 

Athletic talent scouting (e.g., by professional teams prior to a draft of young players) would 
involve an examination of the history of such selections. Each row of the data matrix would 
then pertain to an individual player. The pre-condition state is a selection of statistics (and 
any other information available about the player) thought to be indicative of future 

10 performance at the professional level. The post-condition state would then be some set of 
variables intended to measure that player's success at the professional level. The 
correlations discovered by the invention would help teams find the best set of indicators of 
future success with which to make their selections. Note that in this case, the pre and post 
conditions need not be of exactly the same form. There is no intended restriction on state 

15 representations to force them to be equivalent. 

Weather prediction is a very straightforward application of this invention. Here the 
granularity of the selected time quantum is based solely on the kind of information the user 
wishes to discover. Put another way, the time quantum determines the degree of prediction 
desired. If we choose a single day, then the correlations found by the method will help us 
20 predict the weather (given a set of values for each of the pre-condition variables that 

describes the current weather) a day in advance. If a week (or a month etc.) is the chosen 
quantum, then this is how far into the future the predictions will extend. 

In general, preferred embodiments of this invention will use the method of the current 
invention on databases of this form in order to extract information about how the current 
25 state of the system acts as a predictor for a future state. Given probabilistically bounded 

data correlations between states of the system, effective predictions can be made about the 
system's behavior. 

Steps involved in applying current invention to a database containing pre-condition and 
action variables include: 
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1 . Create the database of transitions between system states, wherein a system state is 
represented by a value of a state variable, over the chosen time quantum as described above. 
Where necessary, use methods known in the art to transform any continuous-valued state 
variables into discrete-state variables. 

5 2. Present this database, in whole or part, such that each state to state transition set 

corresponds to one of the M objects (rows) in the embodiment's data matrix and so that 
each state variable corresponds to an attribute (column) of the data matrix. 

3 . Employ the base method or other embodiment described herein on the data matrix. 

4. Direct the discovered correlated k-tuples of attributes to: 
10 • A graphical viewer or printer, or 

• A report for decision-makers, or a report-generation system, or 

• Another computer program that will use the correlations found as a basis for making 
decisions (for example, a case-based reasoning package), or 

• Another computer program that performs some transformation or optimization on 
15 the database. 

Description of the Application of the Principles Described Herein to Databases 
with Action Variables and Post-condition Variables: 

Here, too, the above-noted restrictions on the form of the database force compliance with 
the input requirements of the embodients described eslewhere herein. The M rows in this 
20 context are the total selected set of actions and post-conditions. The N columns arc 

comprised of the set of state variables that define the state of the system before and after the 
transition (see Table 16). 

The rows of Table 16 correspond to observed instances of, or hypothetical combinations of, 
actions applied to the system and their resulting system states. The columns correspond to 
25 either possible actions that can be applied to the system or are individual state representation 
variables. If column p corresponds to one of the action types in the database, the value in 
table cell[i, p] of Table 16 is an encoding of the action taken. If column j is a column used 
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to indicate some aspect of a state of the system, then the value in table cell[i, j] is an 
encoding of the measure of that aspect. 
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Table 16 

As noted in previous examples, decisions that must be made prior to the application of the 
method of the current invention to databases of this type include the choice of state variables 
10 used to store the state of the system at a given point in time and the choice of time quantum 
used to temporally separate the actions from the post-conditions. These choices are left to 
those skilled in the domain of application. The time quantum chosen must, in the most 
trivial case, be long enough for the actions to have had some effect on the state of the 
system. 

15 Possible uses of this invention include such widely varying fields as player management in 
hockey and the study of drug interaction. 

For the purposes of this document, player management in hockey concerns only the selection 
of players for the next shift on the ice given knowledge of the history of these players. The 
action variables in this case are binary values indicating whether or not a player is selected 

20 for the shift while the post-condition variables comprise a set of outcomes within the domain 
of hockey (such things as ihe relative score in that shift, penalties called, the length of any 
penalties, relative number of shots taken, etc.). By the formulation of the problem, it is clear 
that the discoveries produced by the invention indicate correlations between sets of players 
chosen and outcomes on the next shift. In situations where the opposing players are known 

25 a priori, these players can be added to the action variables. In this case, we will find 

correlations between sets of players, both for our team and against it, and outcomes. Given 
this knowledge the invention is useful as an aid to coaches in selecting players most likely to 
produce beneficial results. 
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The study of drug interaction is a natural fit for this invention Here we let the action 
variables be binary values indicating whether or not a given patient has been administered 
some drug or combination of drugs. The post-condition variables indicate the list of side 
effects reported by the patient. The results found by the invention then indicate statistically 
5 bounded correlations between sets of drugs given to patients and side effects. In this 

fashion, the method of the current invention can be used to determine contra-indications in 
the use of drugs but is perhaps best suited as a way to select sets of interactions upon which 
to focus further study. 

Steps involved in applying current invention to a database containing action and post- 
10 condition variables include: 

1 . Create the database of transitions between system states and actions over the chosen 
time quantum as described above, wherein a system state is represented by a value of a state 
variable and an action is represented by a value of an action type. Where necessary, use 
methods known in the art to transform continuous-valued state variables and action types 

15 into discrete state variables and action types. 

2. Present this database, in whole or part, to an embodiment of the current invention such 
that each action set/state set pair corresponds to one of the M objects (rows) in the 
embodiment's data matrix and so that each state variable or action type corresponds to an 
attribute (column) of the data matrix. 



20 3. Employ the base method or other embodiment described herein on the data matrix. 
4. Direct the discovered correlated k-tuples of attributes to: 

• A graphical viewer or printer, or 

• A report for decision-makers, or a report-generation system, or 

• Another computer program that will use the correlations found as a basis for making 
25 decisions (for example, a case-based reasoning package), or 

• Another computer program that performs some transformation or optimization on 

the database. 
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Description of the Application of the Principles Described Herein to Databases with 
Pre-condition Variables, Action Variables and Post-condition Variables: 

Here, too, the above-noted restrictions on the form of the database force compliance with 
5 the input requirements of the embodiments described elsewhere herein. The M rows in this 
application are the total selected set of pre-conditions, actions and post-conditions. Th^ N 
columns are comprised of the set of state variables that define the state of the system before 
and after the transition as well as the encoded actions types (see Table 17). 

The rows of Table 17 correspond to instances or combinations of pre-condition, actions 
10 taken and the resulting post-conditions. The columns correspond to types of actions 

possible in the domain as well as aspects of interest to any given situation in the domain (for 
both pre and post condition columns). If column p corresponds to one of the action types in 
the database, the value in cell[i, p] of Table 1 7 is an encoding of the action taken. If column 
p is a column used to specify some aspect of either the pre-condition or the post-condition, 
15 then the value in table cell[i, j] is an encoding of the measure of that aspect. 
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20 Table 17 

As noted in previous examples, decisions that must be made prior to the application of the 
method of the current invention to databases of this type include the choice of state variables 
used to store the state of the system at a given point in time and the choice of time quantum 
used to temporally separate the actions from the post-conditions. In this case, it should be 
25 noted that it is not necessary for the pre and post conditions to be equivalent (with respect to 
the choices of variables). These choices are left to those skilled in the domain of application. 
The time quantum chosen must, for example, be long enough for the actions to have had 
some effect on the state of the system. 
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Possible uses of this invention include economic policy, crime-fighting and military 
strategizing. 

Given some set of variables to define the state of an economy (interest rates, inflation, GNP 
and so on) and a set of actions taken as part of the governing body's economic policy 

5 (issuing and buying back government bonds, etc.), we create a database of economic events 
of the form: existing economic state, fiscal policy measures taken and economic state 
following the policy decisions. The correlations found by the method of the current 
invention give a measure to the effectiveness of economic policy decisions, given a state of 
the economy. Such knowledge would be beneficial in deciding economic policy as it would 

10 show historical support (or the lack thereof) for a given set of decisions. 

In a similar vein, the use of the current invention to aid in setting anti-crime policy starts 
with the creation of a database of previous states of the community's crime, policy measures 
taken and the resulting state of crime in the community. The state variables could include 
things like the rates for differing types of crime (breaking and entering, auto theft, etc.), 

15 differing characteristics of crime (i.e. whether or not handguns were used etc.) and so on. 

The action variables in this case could include such things as minimum sentencing guidelines 
for various crimes, "three-strike" laws, the adoption of the death penalty, as well as 
education and mental health funding. On such a database, the invention would find 
correlations involving existing crime states, policy decisions and the outcomes of those 

20 decisions. It is proposed that these correlations could prove an invaluable aid to those 
charged with making such decisions. 

The concept of the "decision-maker" needs careful consideration in the domain of military 
strategy. It may well be the case that there is not enough of a "track record" to fill a 
database with enough of a history of any one general's decision making. In such a case, 

25 preferred implementations can extend the concept of the decision-maker to include all similar 
decision-makers. As an example, consider a single general commanding a tank division. If 
the general were recently promoted, one would be wise to consider all the history of all such 
generals of the same allegiance. To increase further the granularity of the use of the method, 
the database could be filled with the decisions made by all infantry lieutenants rather than 

30 with those of any one lieutenant. Correlations found would be indicative of the tendencies 
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of that class of generals given some measure of the battlefield conditions faced when they 
made their decisions. Equally, one would be in a position to determine which battlefield 
situations they handled poorly because one has access to the outcomes of the decision sets. 
Such knowledge could prove vital to selecting an opposing strategy. 

5 Steps involved in an application of the principles described herein to a database containing 
pre-condition, action and post-condition variables include: 

1 . Create the database of states and actions covering the chosen time quantum as described 
above. Where necessary, use methods known in the art to transform continuous-valued 
state variables and action types into discrete state variables and action types. 

10 2. Present this database, in whole or part, such that each state/action/state triple 

corresponds to one of M objects (rows) in a data matrix and so that each state variable or 
action type corresponds to an attribute (column) of the data matrix. 

3 . Employ the base method or other embodiment described herein on the data matrix. 

4. Direct the discovered correlated k-tuples of attributes to: 

• A graphical viewer or printer, or 

• A report for decision-makers, or a report-generation system, or 

• Another computer program that will use the correlations found as a basis for making 
decisions (for example, a case-based reasoning package), or 

• Another computer program that performs some transformation or optimization on 
the database. 

It will be understood by those skilled in the art that this description is made with 
reference to the preferred embodiment and that it is possible to make other embodiments 
employing the principles of the invention which fall within its spirit and scope as defined by 
the claims on the pages following Appendices A through E attached hereto, which 
25 Appendices form a part of this description. 
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APPENDIX A 



• perl version of Evan Steeg's Coincidence Detection Algorithm, File COincpl: 1/15 
t here applied to data which comes in rows and columns of ascii 

9 symbols. Used first for tests on artificial and real (HIV) 
8 protein sequence data. 

# march 1996 

#«##*####«##»####«####*##«#*#*»######«*#»#«########################«# 

$tiny_num = 0.000001; 

$fact(0] = 1; 
$factUJ*= 1; 
$fact(2J = 2; 
$fact(3] = 6; 
$fact{4J = 24; 
$fact[5] = 120; 
$fact(6) = 720; 
$fact(7) = 5040; 
$fact(8] = 40320; 
$factt9J = 362880; 
$fact[10] = 3628800; 
$fact(ll] = 39916800; 

sub compare 
( 

if ($a < Sb) 
( 

$r = -1; 

) 

elsif ($a =« $b> 
( 

$r = 0; 

> 

else 
( 

$r = 1; 

} 

# print "a: $a, b: $b, r: $r\n-; 

return $r; 

} 

sub comp_aa 
( 

ff my (Sal. $cl . $a2, $c2. Or); 
my ($cl . $c2 ) ; 

# Sal = substr $a, 0, 1; 
Scl = substr Sa. I; 

# $a2 = substr Sb, 0, 1; 
$c2 ~ substr $b. 1; 

if ($cl < $c2) 
{ 

$r = -1; 

) 

elsif <Scl = = Sc2i 
( 

$r = 0; 

} 

else 
{ 

$r = 1; 
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> File coincpl: 2/15 

recurn $r; 



ft calc Che factorial of a number. want {n) 

ft for now, it's just easier and faster to hard code them into a table 

sub factorial 

( 

my ($n) = 

ff print "n: $n\n"; 

if ($n >= 0 && $n <= 11 ) 
{ 

return $fact($nj; 

) 

else 
{ 

print "ERROR: n larger than max defined factorial requested. ($n)\n" 
exit (0 > ; 



) 



) 



# calc the binomial coeff. want r (number of iterations) and h ( 

# observed number of hits) 
sub binomial_coef f 

{ 

my ($r, $h) = <?_; 

# print "r: $r, h: $h\n" ; 

$rf = &f actorial ($r ) ; 
$hf - fcf actorial <$h) ; 
$rhf = &factorial( ($r - $h) ) ; 

ft print -rf: $rf. hf : $hf, rhf : Srhf \n " ; 

return ($rf /' ($hf * $rhf)); 

} 

# calc the chernoff. want (Sobserved. Soxpected, Sri, $T1 ) 
sub chernoff 

( 

my (Sobserved, ^expected. Sri, $T1) = <*_; 

Sdiff = Sobserved - Sexpected; 
$di£f_sq = Sdiff * $diff; 
Snumerator = 2.0 * (0.0 - Sdiff_sq); 
$denominator = ST1 * ($rl * Sri); 
return (exp (Snumerator / $denominator) ) ; 

) 

ft calc the ith power oC a number. NOTE: this thing can only grok. 
ft positive integer exponents larger than 0! 
sub pow 
( 

my (Si, $p> = <?_; 

if l$p < 0 || $p != int (Sp) ) 

( 

print "ERROR: I can only grok positive integer exponents larger than 0!\n - 
exit ( 0) ; 

} 
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Sa = 1.0; File coincpl: 3/ 15 

for ($n - 0; $n < $p; $n++) 

( 

$a *= $i; 

) 

# print "i: Si- p: $p, a: $a\n B ; 
return $a; 

) 



# want ($r, $h, $c_element) , cset and aasites assumed as global 

sub prob__ coincidence 

( 

my (Sr. $h, $c_eleroent) = @_; 
my ©elements ; 

if <$r > 0) 
{ 

S joint = 1.0; 

$joint__neg = 1.0; 

eaalist = split /\|/, Sc_element; 
Oprinc "c_elelment : $c_element, aalist: @aalist\n"; 

foreach $aa (Baalist) 
( 

Sjoint *= Saasites ($aa) ; 

$joint_neg •= {1.0 - Saasites ( $aa }) ; 
#print 'aa: $aa, joint: Sjoint, joint_neg: S joint_neg\n" ; 



} 

# Sans = &binomial_coef £ (Sr. $h> * &pow( Sjoint . $h) * 

tt &pow($ joint_neg, ($r - $h) ) ; 

Sans = &binoroial_coe££($r. Sh) ♦ {Sjoint - • $h> * 
(Sjoint_neg ** ($r - $h) ) ; 

) 

else 

return (0.0) ; 

) 

#q print "joint: Sjoint. joint_neg: $joint_neg, ans: Sans\n 
return Sans ; 

) 

sub *xpecced_size 
( 

my (Sr. $c_elerr.ent > = 0._; 



5 sum - 0.0; 



foreach $h ( 1 . .Sr^ 
( 

Ssum += (&prob_coincidence($r ( $h, $c_element) * Sh) ; 
•print "r: Sr. h: Sh, sum: $sum\n" ; 
) 

return Ssuici; 

) 



sub prob_of _correlat ion 

my ($c_element, Sh_total^obs , $h_expected_total . Sr. ST) = <?_; 
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File coinc.pl: 4/15 

ft $h_expected_total = &expected_size($r, $c_element) ; 

Sen = ichernof f ( $h_total_obs , ( $h_expected__ total * ST> , $r. ST); 

return $ch; 

) 

# randomly select a list of ' sample.size ' unique sequences 
ff in the range from 0 to the number of rows in ©family 

# want sample_size ( family, 
sub rsample_f amily 

{ 

my $R = shift @_; 
my ©family = @_; 

my (%which_rows, @ samp led_ family, @ samp led_ rows ) ; 

# print "whxchrows: keys %which_rows , "\n"; 

# generate $R number of unique keys 
$f = scalar ©family; 

while (scalar (keys %which_rows) < $R) 
{ 

$n = int (rand $f); 
ftprint "randnum: $n\n* ; 

$which_rows{$n) = 1; 

) 

# print "whichrows: " , keys %which_rows , * \n" ; 

n pick out the corresponding sequence from the 'family list' 
@sampled_rows = keys %which_rows ; 
foreach $line (@sampled_rows) 
{ 

push @sainpled_f amily, $f amily [ $line ] ; 

) 

#print "RSAMFLE\n" ; 

# Si = 0; 

# foreach Sline ( @sampled_ family ) 

# ( 

» print Sline, " : * ; 

ft Sn = $sarrpled_rows [ $ i ] ; 

# print Sn, family ( $n J , "\n"; 

# $ i + + ; 

# print "$line\n"; 

# } 

ttprint "RSAMPLE END\n" ; 
#exit <G) ; 

return @sampled_f ami ly ; 

) 



It return "he n'th column of an array 
ft want t^n, @arrayj 
sub column 
( 

my Sn = shift 
my «a = 
my Scoi ; 

Itprinc "COLUMN: Sn\n"; 

ttforeach (@a) 

#( 

# print " S_\n" ; 
ft) 
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File coinc.pl: 5/15 

# go thru and append che n'th element of each row in Qarray to $col 
$col = " " ; 
foreach $line (@a) 
{ 

$col = $col . substr $line, $n, 1; 

) 

8 print length Scol. Scol, "\n"; 
tfprint -COLUMN END\n"; 

return $col ; 

) 

9 find all occurences of a character 'aa' in the n'th column of the 

# array s amp led_ family 

ft want ($aa, $n, @sampled_f amily ) 

sub find_all 

( 

my $aa = shift @_; 

my $n = shi f t (?_ ; 

my @san;pled_family = <?_; 

my ($bstring, Scol ) ; 

8 print ■ FIND_ALL : $aa , $n\n" ; 

tt .print -01234567890l234567890\n" ; 

U foreach (0sampled_f amily) 

# { 

# print "S_\n"; 

# J 

# print "JUMPING TO COLVn" ; 

Scol = tcolumn ($n, @sampled_f amily ) ; 
ft print "GOT: $col\n"; 

Sbstring = 

i f ( (index $col , $aa) != -1) ft make sure 3aa is found in Scol 
( 

for {$i. = 0; $i < length Scol; $i^+l 
( 

$c = substr Scol, $i, 1; 
if t$c eq Saa) 

Sbstring = Sbstring . "1"; 

else 

Sbstring = Sbstring . "0"; 

) 

else 
t 

Sbstring = " NOT_FOUND " ; 

) 

8 print ' Sbstring\n* ; 

* print ' FIND .ALL END\n" ; 

ft exit (0) ; 

return Sbstring; 

> 

8 this subroutine isn't exactly the most optimal code, but. . 
sub mi 
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( File coinc.pl: 6/15 

my (Scoll. $col2, $m) - @_; 

my ($sl, $s2. $row, $pl, Sp2 , Spj , Sal, $a2, $s, %sj, $concrib. $total>; 

$sl = column ( $coll , ©family) ; 
$s2 » coxumn ( $col2 , Qfamily) ; 

ft print "coll: $coll, $sl\n"; 

# print 'col2: $col2, $s2\n"; 

# print - Jceysl: keys %sj , "\n"; 

# calc the joint prob 
for $row (0. . ($m-l) ) 
( 

$al - substr $sl, $row, 1; 
$a2 = substr $s2, $row ( 1; 

$s = Sal . $a2; 

if (exists $sj{$s}) 
( 

Ss j { $s> ++; 

} 

else 
{ 

$sj{$S> = 1; 

} 

* print "al: 3al. a2 : $a2 . s: $s\n"; 
} 

# print "keys2: keys %sj. "\n"; 



foreach $s (keys %sj) 
{ 

$sj($s} = $sj{$s) / $m; 

if ($sj($s) < Stiny_num) 

$sj{Ss) - $Liny_num; 

If print "$s: $s i { $s ) \n" ; 

} 

Stotal = 0; 

foreach Ss (keys %sj) 

{ 

Sal = substr $s. 0, 1; 
$a2 = substr $s, 1, 1; 

it find partial probs 
Sal - Sal . Scoll; 
Sa2 = Sa2 . Scol2 : 

Spj S3] t$s) ; 

Spl = asites (Sal ) ; 

$p2 = Saasites (Sa2 > ; 



if (Spl < Stiny_num> 

{ 

Spl = $ tiny_.nujr\; 

} 

if (Sp2 < $tiny„num) 
( 

Sp2 = Stiny_num; 
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? c File coinc.pl: 7/15 

n (Spj < Stiny_num) 
( 

$pj = $tiny_num; 

} 

Scontrib = (Spj * log (Spj / (Spl * Sp2))); 
Stotal += Scontrib; 

tt print "al : Sal, a2 : $a2 , s: Ss, p j : Spj. pi: Spl. p2 : Sp2, contrib: Scontrib. total: Stotc 
) 

return Stotal; 

) 



sub incidence.vec 
{ 

my (Scol, Skey) = @_; 
my ( Svec) ; 
$ vec = 

i f ( (index Scol. Skey) != -1) 
( 

for Si (0..( (length Scol) - 1)) 
( 

$c = substr Scol, Si. 1; 
if ($c eq Skey) 

{ 

$ vec = $vec . "1"; 

) 

else 
( 

$vec = Svec . " 0 " ; 

) 

) 

) 

else 
{ 

Svec = -NOT_FOUND* ; 

) 

return Svec; 

) 



8 given two colunns. go through each letter in the alphabet and 

generate the incidence vector for them. then if the results are 
ft non-zero, send them to ir.i2_real for the re-?.i computations 
sub mi2 
( 

my (Scoll. $co!2, Sml = 

my ($sl. Ss2. Skeyl, 5 key 2, $ total , SsumJ ; 

Ssl - columr, ( Scoll , ©family) ; 
Ss2 = column ( $col2 „ ©family): 

Ssura = 0.0; 

foreach Skeyl (keys %alphabet) 
( 

Svecl = incidence_vec (Ssl, Skeyl); 

if (Svecl ne *NOT_FOUND" ) 
( 

foreach Skey2 (keys %alphabet) 
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Svec2 = incidence_vec ($s2 , $key2) ; 

if ($vec2 ne *NOT_FOUND" ) 

( 

$total = mi2_real<$vecl , $vec2 , Sm) ; 



ft print "si : $sl\n*; 

# print "vecl: $vecl\n"; 

* print *s2 : $s2\n"; 

8 print "vec2: $vec2 \n" ; 



if ($ total > 1-0) 

print f *ini2, cols: %d, %d j Key! : $keyl j key2 : $key2 \ total: %.3'fVn" 



$sum $total 

} 

) 



) 



) 

) 

print "total sum: $sum\n" 



# Given two columns (the actual string of amino acid symbols), 

# produce all combinations (pairs) of attrl, attr2 , where attrl is 
H an incidence vector for a symbol occurring in coll and 

ft likewise for attr2 from col2 . Then call mi2 on the pair 
ff of incidence vectors. 

# Compute mutual.info (attrl, attr2) where attri are binary incidence 

# vectors for two aKScoll, a2@col2 . 

sub mi2_real ( 
my (Sattrl, $attr2. $m) = 

my (Sa , Sal, $a2 , $s , $p0, $pl. Spj . %hash_s inglel , %hash_s ingle2 , 
Stotal , %hash_joint ) ; 

for Srow I 0 . . <$m-l) i 
I 

Sal = substr Sattrl, $row, 1; 

$a2 = substr Sattr2. $row, 1; 

$s = Sal . $a2; 

#print "row: Srow, al : Sal, a2 : Sa2 , s: Ss\n* ; 

if (exists $hash_singlel { Sal) ) 
I 

Shash_s ingle 1 ( Sal ) ♦ + ; 

) 

else 
( 

Shash_s ingle 1 ( Sal) - 1; 

) 

if (exists Shash_single2 { Sa2 ) ) 
( 

Shash„single2 ( $a2 ) « + ; 

} 

e lse 
( 

Shash_single2 (Sa2) = 1; 

) 
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if (exists $hash_joint{$s) ) File Coinc.pl: 9/ 15 

( 

$hash_ joint ($s) ; 

) 

else 
( 

$hash_joint{$s} = 1; 

) 



foreach $s (keys %hash_joint) 
( 

$hash_joint ($s) ~ $hash_joint ($s) / $m; 

if ($hash_ joint ($s) < $tiny_num) 

( 

$hash_ joint ($s) = $tiny_num; 

) 

#print "s: $s, h j : $hash_ joint ($s) \n" ; 
) 

foreach $a (keys %hash_singlel ) 
( 

$hash_singlel ($a) = $hash_singlel f $a } / $m; 

if ($hash_singlel{$a) < $tiny_num) 
{ 

$hash_s ingle 1 ($a) = $tiny_num; 

) 

#print *a: $a, hsl : $hash_singlel ($a ) \n" ; 
) 

foreach $a (keys %hash_single2 ) 
( 

$hash_single2 ( $a) = $hash_single2 ($a ) / $m; 

if ($hash_single2 ($a) < $tiny_num) 
( 

$hash_singie2 ( $a ) = $tiny__num; 

) 

#print "a: Sa, hs2: $hash_single2 ($a) \n" ; 
> 



foreach Ss (keys %hash_joint> 
( 

Sal = substr Ss, 0. 1; 
$a2 = substr $ s , 1, 1 ; 
$pj = $hash_joint ($s ) ; 
$pl = $hash..singlel{Sal) ; 
Sp2 = $hash_single2 ($a2) ; 

it (Spl < $tiny_num) 

( 

Spl = Stiny_num; 

) 

it ($p2 < Stiny_num) 
{ 

$p2 - $tiny_nuin; 

) 

it ($pj < Stiny_num) 
{ 

$pj = $tiny_num; 

} 
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Filecoinc.pl: 10/ 15 

Stotal += ($pj * log ($pj / ($pl * $p2>)>; 

) 

return Stotal; 

) 



################################################»#####* 

• check to make sure a file name was given 
if (scalar ©ARGV i = 4 J 

{ 

print "usage: $0 data_file saniple_si2e iterations min_freq\n"; 
exi t ; 

) 

$filename = $ARGV{0] ; 
$sample_size = $ARGV[U ; 
Siterations = $ARGVf2J; 
$min_freq = $ARGV[3]; 

# read contents of file into array family 
open (DATAFILE, $ filename) ; 

9 family = <DATAFILE> ; 

chop ©family; 

# remove nial's +, and | delimiters 

#<?family = grep (!/\+/, ©family); ft get rid of lines beginning with '+* 

iforeach (©family) # remove all '|* s 

#( 

* tr/\|//d; 
#> 

ft@family = grep (/~W/, ©family) ; 

ftforeach (©family) 
ft( 

ft print "S_\n"; 
ft) 

ftwhile (length 5 family [( scalar ©family} -1) < J.) 
#< 

ft print "Empty line: " , scalar ©family, " deleted. \n* ; 

ft pop ©family. 

#) 

#$i = 0; 

tforeach (©family) 
#( 

ft print "$i: $„\n* ; 

ft $i + +; 

ft) 



#*#«#ft#ft#tt#ft#tt##flft##ftftftft###ttttt»####ft»###»ftft«#ft«#ft#ft#»tt#»ft#t»ft#ttft##tt4#»# 

ft NOW for the real stuff! 

print " Sample_s ize : $sample_size\n" ; 
print "Iterations $ iterations\n" ; 
print "Min_fre : $min_freq\n" ; 

ft construct aasite list 
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Sn = length $ family (0] ; File COinC.pl: 11/15 

$m = scalar ©family; 
foreach Srow (©family) 
( 

for $j (0. . ($n - 1) ) 
{ 

$c = substr Srow, $j, 1; 
if (length $c != 1) 
{ 

print "BUG!!! $row, $j\n"; 
exi t ; 

> 

#print "$line:$ j : $c\n* ; 
$i « $j; # + 1; 

$s « $c . $i; 8 create aasite name 

# print -c: $c, j: $j , i: $i, s: $s\n" ; 

if (exists Saasites (Ss ) ) 

Saasites ( $s ) ++ ; 
else 

Saasites (Ss} = 1; 

} 

) 

# figure out the alphabet 
#<?a - keys %aasites; 
#print ©a, "\n"; 

# foreach (©a) 
»{ 

« print "$_:$aasites{$_}\n"; 

#) 

foreach Sentry (keys %aasites> 
( 

Sc = substr Sentry, 0, 1; # want the first character in each entry 

# print Sc , ' \n" ; 
Salphabet ( $c } - 1; 

) 

print keys %alphabet, "\n"; 

ft calc marginal probabilities for each column of aasites 

foreach $key (keys %aasites) 

{ 

Sp - Saas ites { $key ) / $m; 
Saasites ( Skey} - Sp; 
t* print "Skey : $p\n" ; 



for Scoll (0 . . ($n-2) ) 
{ 

for Scol2 UScoll + 1) . . (Sn-D) 
( 

Smi = imitScoll. Scol2 . $m) ; 

print "columns: *, (Scoll + 1). " ($col2 + 1). " mi - Smi\n" ; 

$mi2 = mi2($coll, $col2. $m) ; t* might as well qo mi2 while we're here 

) 

) 
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Filecoinc.pl: 12/15 

#exit; 

### MAIN LOOP 

# seed the random number generator 
#$seed a 111; 

fsrand <$seed) ; tt remove ' ($seed) to get seed from the system clock 
srand ( ) ; 

# print "START MAIN LOOP\n"; 

for ($iter-0; $iter < $iterations; $iter++) 
( 

my %BINS; 

# print ' \n ITERATION : $iter\n"; 
print STDERR "ITERATION: $iter\n" ; 

# print "JUMP TO rsample_f amily\n* ; 

0sampled_f amily ~ &rsample_ family ($sample_size. @ family) ; 

# print "sample size: $sample_size\n" ; 

# print " 012345678901234567890\n" ; 
ft $i = 0; 

# foreach (@sampled_ family) 

# ( 

U print "$i : $_\n" ; 

# $i++; 

# ) 

# print "rsample printed\n"; 



foreach Saasite (keys %aa sites) 
( 

$aa = substr Saasite, 0, 1 ; 

$col_num = substr Saasite, 1; 

# print 'aa: $aa, colnuni: $col_num\n " ; 

$occurence_string - &find_all ISaa, $col_num, &sampled_f amily ) ; 
# print $occurence_string. "\n", 

if ( $occurence_string ne " NOT_ FOUND " ) 
{ 

K print "FOUND occ_str: $occurence_st ring\n" ; 

if (exists $BINS { $occurence_scring) ) 
( 

SBINS C Soccurc nce_string) = SBINS ( $occurence_str ing ) 

Saasite . " | " ; 

) 

else 
t 

SBINS ($ occur ence_str ing ) = Saasite . "|"; 

) 

) 

) 

tt foreach (keys %BINS) 

# { 

# print " $BINSt$__) \n" ; 

# ) 

it sort the collision list associated with each BIN and throw away 
8 entries with just one 'collision' 
foreach Sbin (keys %BINS) 

t 
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my Saalist; File C01DC.pl: 13/15 

$s = $BINS{$bin>; 

# print $s , " \n" ; 
Saalist = split /\|/. $s; 

# $i = 0; 

ft foreach (Baalist) 

ft ( 

# print" $i:$_\n'; 

# $i++; 
ft ) 

if ( (scalar Baalist) > 1) # throw away single 'collisions' 
( 

# then sort the others 

# $sorted_aalist - join "|". sort comp_aa Baalist; 
$sorted_aalist = join '|", sort ©aalist; 

# print "sorted aalist: $sorted_aalist\n - ; 



$BINS(Sbin> = $sorted_aalist ; 

> 

else 
{ 

* print •chucked\n" ; 
delete SBINS { $bin} ; 

) 

) 

ft print "SORTED BINS\n"; 
ft $z = 0; 

# foreach (keys %BINS) 
ft ( 

ft print •$z:$_:$BINS($_) \n- ; 

t $z*+; 
ft ) 

ft now we update the cset table 

foreach $bin (keys %3INS> 

[ 

Scount = 0 ; 

ft sum up bin hits; sainple_size should equal length of bins 

for (Si^O; $i *- $sample_size; $i + + ) 

{ 

$c = substr $bin, $i. 1; 

if (Sc eq "l") 

( 

$count+ * ; 

} 

i 

$key = SBINS ($bin) ; 
ft print "cset key: $key\n" ; 

if (exists Scset(Skey)) 
r 

Scset(Skey) += Scount; 

> 

else 
( 

Scset(Skey) = Scount ; 

) 

) 

ft print "CSET\n" ; 
ft $z = 0; 
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* 

# 

# 



foreach (keys %cset) 
{ 



Filecoinc.pl: 14/15 



print * Sz : $_: Scset ($_} \n" ; 



$z + +; 

) 



# print "Siter, BINS : scalar keys %BINS, ' 

# print "CSETS: scalar keys %cset, "\n - ; 
print STDERR " BINS : - , scalar keys *BINS, "\n"; 
print STDERR "CSETS : scalar keys %cset, "\n"; 



print "CSETS: • . scalar keys %cset, *\n"; 

print " \n\nGathering stats.Vn* ; 

foreach Sentry (keys %cset) 
( 

$h_total„obs = $cset($entry); 

$h_expected_total = &expected_size{$sample_size. Sentry); 
Scorrelation = &prob_of_correlation ( Sentry , $h_total_obs , 



f (Scorrelation < 0.000000001) 
Scorrelation = 0.0; 



if (Sh_total_obs >= $min_freq) 

# this is a weelly ugly hack to prevent hash key collisions 

$h = $h_total_obs; 

while (exists $output($h)) 

( 

$h = $h . •*■; 

) 

» print '\nEntry : SentryVn" ; 

it print "Obsrv hits: $h_total_obs\n" ; 

ft printf 'Expct hits: %.9f\n", $h._expected_total * Siteraticns; 

n printf 'Prob corrl : %.9f\n", Scorrelation; 

$output{$h) [0| = Sentry ; 

Soutput ($h}[l] = $h_total__obs ; 

Soutput { $h) [ 2 1 = $h_expected_total * $ iterations ; 
Soutput ( $h) t 3 ] = Scorrelation; 



Shits = keys %output; 
©hits = sort compare @hits; 
ttGhits = sort ©hits; 

8 foreach ( @probs } 
»{ 

3 print * $_\n" ; 
#> 

print "SORTED\n"; 
foreach Shit (©hits) 
( 

my { @aa list) ; 

tt $ i = index Shit , 
tt if (Si ! = -1) 



) 



$h_expected_total , 
$sample_size, 
S iterations ) ; 
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I { $h = substr shit, o. (index shit. --); FUecoioc.pl: 15/15 

# } 
u 



$s = Soutput (Shit ) [0] ; 
Baalist = split /\\/. Ss; 
Coreach (Gaalist) 
( 

$aa = substr 0, 1; 

$col_num = substr $_, 1; 
$_ = $aa . ($col_num + 1 ) ; 

> 



Ss = join *|*. sort comp_aa Qaalist; 



# print -\nEntry : Soutput { Shit } [OJ , M \n"; 

Sobserved = Soutput { $hit ) (1) ; 
Sexpected = Soutput { Shit } [ 2 ] ; 
Sprob - Soutput (Shit) (3] ; 

if (Sexpected < Sobserved && Sprob < 0.5) 
{ 

print " \nEntry : $s, "\n"; 

print "Obsrv hits: " , Soutput { $hit) [ 1 ] . " \n u ; 
printf -Expct hits: %.9f\n", Soutput { $hit )[ 2 ] ; 
printf "Prob corrl: %.9f\n", Soutput ( $hit )( 3 1 ; 

) 

) 
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APPENDIX B 



TRPNNNTRKSVRIGPGQAFYATGDIIGDIRQAH HTV input: 1/ 10 

TRPNNYTRKMI PTGPGQVI YATGKIIGDIRKAY 

SRPNNNTRKSVHMGPGRAFYATGDI IGD IRQAY 

IRPGNNTRKSMHIGPGRPFYARG-VIGDIRQAH 

IRPNNNTRKSIHIGPGQAFYATGDIIGNIRQAH 

IRPNNNTRTSVHMGPGKTFYATGDIIGDIRQAH 

TR PNNNTRRSMR I G PGQT FYATGD I IGD IRQAY 

TRPNNNTRKSIRIGPGQAFYATGDI IGDIRQAH 

TRPSNNKRTS IH I APGRAFYATGA I IGDIRQVH 

IRPNNNTRRSVRIGPGQAFYATGDI IGDIRQAH 

TRHNNNTRKS I R IGPGQAF YATGDI IGDI RQAH 

TRPSNNTRKS I RIG PGQAFYATGDI IGDIRQAH 

TRPNNNTRRSIHIGSGRAFY 1 IGDIRQAH 

IRPSRTTRKRWHIGSGQAFYAIDGITGDIRKAY 
TRPNNNTRRRMHIG PGRAF IATDA IVGD IRQ A Y 
TRPSNNTRKSVPIGPGQAFYATDDI IGDIRQAH 
TRP SNNTSKS IRIGPGQTFYATGRI IGDIRQAH 
IRPSNNTRKSVNIGPGQAFYATGDIIGDIRQAH 
TRPGNNTRKSVRIG PGQAFYATGDI IGDIRQAH 
TRPGNNTRKSWHIGPGRAFYTTDGIIGDIRKAY 
IRPGNNTRKGVHIGPGQAFYARGDI IGDIRQAH 
TRPGNNTRKSLRIGPGQTFYATGDI IGDIRQAH 
TRPNNNTRKSVRIGPGQAFYATGDIIGDIRQAH 
I RPNNNTRKSVH IGPGQAF YATGDI IGDIRQAY 
TRPNNNTRKSVRIGPGQTFYATGDIIGDIRQAH 
TR PGNYTRKSVRTG PGQTF YATG KI IGDIRQAH 
TRPNNNTRKG I HIGPGSAI YATGDI IGDIRQAH 
TRPNNNTRTG IHIGPGQTFYATGEI IGNIRQAH 
TRPNNNTRRSVRIG PGQTFYATGAI IGDIRQAH 
IRPNNNTRKSVRIGPGQTFYAAGDI IGDIRQAH 
TR PGNNTRRSVRIG PGQAFYATGEI IGDIRKAH 
TRLSNNTRKSVRIGPGQTFYATGEIIGDIRRAH 
TRPNNNTRKSVRIGPGQTFYATGDI IGDIRQAH 
TRPNNNTRTSVR I G PGQAFYATGDI IGDTRQAH 
TRPGNNTRRSVRIGPGQAI YATGDI IGDIRKAH 
SRPNNNTRRS I H FG PGQTLYATGNI IGDIRQAH 
TRPNNNTRRSIRIGSGQTSYATGDIIGNIREAH 
SRPGNNTRKSVRIGPGQTFYATGDIIGDIRQAH 
TRPNNNTRKSVRIGPGQTFYATGDI IGDIRQAH 
TRPNNNTRKSVRIGPGQTFYATGDIIGDIRKAH 
TRP S NNTRKG I H IG PGRAF YATGQ I TGD I RQAH 
TRPGNNTNKNVHIGPGQAFYARGRIIGDIRKAH 
TRPNNNTRMS IRI G PGQ AF Y ATGD 1 1 GN I RQAH 
TRPNNNTRKSIHIGPGQAFYATGDI IGNIRQAH 
TRPNNNTRTGIHIGPGQAFYARGAITGDIRKAY 
TRPXNNTRKS I H I G PGQ AFY ATGD I IGDI RKAH 
TRPNNNTRTS IRIGFGQTFYATGDI IGNIRQAH 
TRPGNNTRTS I RIGPGQAFYGRGN I IGDIRKAH 
TRPNNNTRRSIRIGPGQAFYATGDITGDIRQAH 
ARPNNNTRRSIHIGPGQAFYA-SDI IGDIRQAH 
TRPNNNTRKSVH IG PGQArY ATGD I IGDIRQAH 
TR PNNNTRK S I R IGPGQAF YTTGD I IGDI RQAH 
IRPNNNTRTS I RIG PGQAFYATGDI IGD I RQAH 
TRPNNNTRKSVP IG PGQ AFYATDN I IGDIRQAH 
TRPNNNTRTS IC IGPGQTFYA-GG I IGDIRQAH 
TRPNNNTRKSVHIGPGQAFY ATGDI IGNIRQAH 
TRPNNNTRXS IH IGPGQAF YATGDI IGD IRQAH 
TRPSNNTRTSIRIGPGQAFYATGDIIGDIRQAH 
TRPNNNTRKSANIGPGQAFYATGEI IGDIRQAH 
IRPNNNTLKGIHIGPGQSFYATGSIVGNIRQAH 
IRPYNNTRKSIHIGPGQAFYA-SRIIGNIRQAH 
TR PNNNTRKS I RIGPGQTFY A -GEI IGNIRQAH 
TRPNNNTRKGVHIGPGQ AFY ATGDI IGDIRQAH 
TRPNNNTRKSVRIG PGQ AFY ATGD I IGDIRQAY 
TRPNNNTRTS I R IGPGQS FHATGDI IGDIRQAH 
SRPNNNTRKSVHIGPGQAFYATGDVIGDIRQAY 



- 139 - 



SUBSTITUTE SHEET (RULE 26) 



BNSDOCID: <WO 9843182A1_I_> 



WO 98/43182 



PCT/CA98/00273 



IRPNNNTRKSVP IGPGRAFYATGDI IGNIRQAH |jjy fop^. 3/ 10 

TRPNNNTRKGVRIG PGQAF YATGGI IGDIRQAH F 

TRPNNNTRKSVR IG PGQAF YATGDI IGDIRQAH 

TRPNNNTRTSVR I G PGQTFYATGDI IGDIRRAY 

VRPNNNTRTSVRIGPGQTFYATGEIIGDIRRAF 

TRPNNNTRRS I R IGPGQAF YATGDI IGDIRKAH 

I RPNNNTRKSVH IGPGQAF YATGDI IGDIRQAH 

IRPNNNTRKSVHIGPGQTS YATGDI IGDIRQAH 

TRPNNNTRKSVH IGPGQAFYATGDI IGDIRQAH 

TRPNNNTRRSVH IGPGQAFYATGDI IGDIRRAH 

TRPNNNTRKS I HIjGPGRAF YATGDI IGDIRQAH 

SRPYN -TRKNYS IGSGQAFYVTGKI IGDIRQAH 

TRPYKKVRRRIHIGPGRSFY— T-SNIX3DIRQAY 

TRPNNNI SRR I H IGRGQAFYATGGMTGNIRQA Y 

IRPNNNTRKSVRIGPGQAFYATGDIIGNIRQAH 

TRPNNNTRRSVRIGPGQTFYATGDI IGDIRQAH 

TRPNNNTRTSVHIGPGQAFYARGDI IGDIRQAH 

TRPNNNTRKS I H IG PGQAFYARGDI IGNIRQAH 

TRPNNNTRKSVHIGPGQAFYATGEI IGDIRQAH 

TRPNNNTRKSVR I GPGQTF YATGDI IGNIRQAH 

TRPNNNTRKGVH IGPGQAFYATGDI IGNIRRAH 

TRPNNNTRQSVH IGPGKAFYATGGIVGDIRQAY 

TRPNNNTRKiVHIGPGQAFYATGAIIGSIRQAH 

TRPNNNTRRSVHIGPGQAFYATGDIIGDIRQAH 

TRPGNNTRRSVRIGPGQTFYATGDI IGDIRQAH 

I RPNNNTRTSVRIG PGQAF YATGDI IGDIRKAY 

TRPNNNTRKSIGIGPGQTFYAADNIIGDIRQAH 

TRPGNNTRTSVR IGPGQAFYATGDI IGDIRQAH 

TR PNNNTRTSVRIG PGQS FYATGD I IGDI KQAH 

MRPNNNTRKS I S IG PGRAFFATGD I IGDIRQAH 

TRPSNNRRQSVR IGPGQAFYATGDI IGDI RRAH 

TRPNNNTSQGVHIGPGQVFYARDRI IGDIRKAY 

TRPNNNTRKSVR I GPGQT FYATGD I IGDIRQAY 

IRPNNNTRRG I HMG PGQ I L YATG5 1 IGDIRQAH 

TRPNNNTRKS I R IG PGQVFYTN- D I IGDIRQAH 

TRPNNNTRKSVH IG PGQAF YATGDI IGNIRQAH 

TR PNNNTRKS I R IG PGQAF YATGD I IGNI RQAH 

TRPNNNTRKS IRIGPGQVFYATG* 

TRPtTNNTRKSVRIG PGQT FY ATGDI IGDIRQAH 

TRPNNNTRTSVR IGPGQAFYATGDI IGDIRRAH 

TRPNNNTRKS IH IG PGRAFYTTGEI IGDI RQAH 

TRPNNSKRKTLHMGPKRAFYATGDIGGYIRQAH 

TRPNNNTRKS IQIGPGRAFYTTGEI IGDIRQAH 

TR PNNNTRKG I HMG PGSTFY ATGE I IGD I RQAH 

TRPSNNTRKG I HIiG FGRALY ATGEITGD I RQAH 

TRPNNNTRKS LSIiGPGRAFYTTGDIVGDIRQAH 

TRPSNNTRKG IHIGPGRTFFATGEI IGDIRQAH 

TRPNNNTSKG I HKGPGGAFYTTGRI IGDIRRAY 

TRPNNNTRKS IS IGPGRAFYATGDI IGDIRQAH 

TR PNNNTRKG I HMGWGRT FY ATGE 1 1 G A I RQ PH 

TRPNNNTRKS I HMGWGRAF YATGDI IGDIRQAH 

TRPNNNTRKS I HVGWGRSL7TTGEI IGNI RLAH 

TRPNNNTRKS I HMGWGRAFYATGEI IGDIREAH 

TRPNNNTRKRIYIGPGRAVYTTGQIIGDIRRAH 

ERPNNNTRKSINIGPGRAFYTTGDIIGDIRQAH 

TRPSNNTRKS IHLGLGRAFYTTGDI IGDIRQAH 

TR PHNNTRRS I T I G PGRAF YTTGD I IGD I RQAH 

TRPSNNTRKSIHliGWGRAFYATGEIIGDIRQAH 

TRI-NNNTRTS I H IGPGQAFYATGDI IGDI RQAH 

TRPNNNTRKS IH IG PGSAF YATGDI IGDIRQAH 

TRPNNNTRKS I HMGWGRTFYATGEIIGDIRQAH 

TRPNNNTRKG I H IG PGRAF Y AT- EITGDIRQAH 

LRPSNNTRKS I HMGWGRAFYATGEI IGDIRQAH 

TRPNNNTRKS I HMGWGRAFYATGEI IGNI RQAH 

TRPGNNTRKG I P IGPGG S FY ATERI IGDI RQAH 

I R PNNNTRRS 1 1 IGPGRAFYATGDI IGDI RQ AY 
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TOPNNNTRKSIHIGPGRAFYATGDI IGDIRQAH 
TRPNNNTXKSIHIGPGSAFYATGDIIGDIRQAH 
TRPGNNTRRS I HMGWGRAF YATGDI IGDIRQAH 
TRPNNNTRKS I H IGPGRAFYATGDI IGDIRQAH 
TRPNNNTRKS I HMGWGRAFYATGE I IGNI RQAH 
TR PNNNTRKS I H IG PGKAF YATG E I IGN I RQA Y 
TRPNNNTRKSIHLGWGRAFYATGEIVGDIRQAH 
TRPNNNTRKS I TIG PGRAFYATGE I IGDIRQAH 
TRPNNNTRKS I HMGWGRTFYATGE I IGDIRQAH 
TRPSNNTRKGIH IGPGRAFYATGDI IGDIRQAH 
TRPSNNTRKS I H IGWGRA I YATG AX IGDI RQAH 
TRPNNNTRKS I HVGWGRALYTTGE I IGNI RQAH 
TRPNNNTRKS I Q YGTGGAFYATG E IVGD I RQAH 
TRPGNNTRKSIHIGPGRAFYTTGDI IGDIRQAH 
TR PNNQTRKS I HMGWGRAFHTNGE I IGNIROAH 
TRPNNNTRKG I HMG LGRAFYATGG IVGD I RQAH 
TRPSNNTRKGIH IGWGRAFYATGEITGDIRKAY 
SRPNNNTRKS I HMGWGRAFYTTGEI IGDIRQAH 
TR PNNNTRKS IHIGPGRAFYTTGEI IGDI RQAH 
TR PGNNTRKS I HLGWGRAFYATGAI IGDI RQAH 
TRPSNNTRKS IHLGWGRAFYATGEIVGDIREAH 
TRPSNNTRRS IHLGPGGAFYTTGEI IGNIRKAF 
TRPNNNTRKS I RIGPGSAFYATGDI IGDIRQAH 
TRPNNNTRKS IPIAPGSAWFATGEI IGDIRQAH 
TRPNNNTRKS IHLGWGRAFYTTGQI IGEIRQAH 
TRPNNNTRKS I HVGVGRA I YATGE I IGDIRQAH 
TRPSNNTRKS I HMGWGRAFYATGE I IGD I RRAH 
TRPNNNTRKS IHMGWGRAFYTTGDI IGDIRQAH 
TRPNNNTRKRKS IGPGRAFYTTGEVIGDIRQAH 
TRPNNNTRKS IHMGPGSAI YATGE I IGDI RKAY 
TRPNNNTRKGIHIGPGRAFYTT— DIIGDIRQAH 
TRPNNYTSKRIRIGARRAFYTKGKI IGDIRQAH 
TRPNNNTRKGIHIGPGRAVYTTGRIVGDIRLAH 
TRPNNNTRKS IQRGPGRAFVTIGKI -GNMRQAH 
TRPNNNTRNRIS IGPGRAFHTTKQ I IGDIRQAH 
TRPNNNTRKS I TKGPGRV I YATGQ I IGDI RKAH 
TRPYNNVRRSLS IGPGRAF * RTRE I IGI IRQAH 
TRPNNNTRKS INIGPGRAWYAT-N I IGDIRQAH 
IRPNNNTRKS I P IGPGRAFYATGDI IGDIRQAH 
TRPNNNTRKS IHIGPGRAFYT-GE I IGDIRQAH 
TR PNNNTS KR I S IG PGRAFRAT - KI IGN I RQAH 
TRPNNSTRKRI S IGPGRVWYTTGQI IGD IRKAH 
TRPNNNTRKRIS IGPGRVWYTTGQI IGN IRKAH 
TRPNNNTRRSGH IGGGRTLFTT -H IVGDI RKAH 
TRPNNNTRKS IHIGPGRAFYT-GEI IGDIRQAH 
TRPNNNTSKRI S IG PGRAFRAT- KI IGNIRQAH 
TRPNNNTRKRIS IGPGRASYTTGQI IGDIRKAH 
TRPNNNTRKRIS IG PGRAWYTTGQ I IGD I RKAH 
TRPNNNTRRSGH IGGGRTLFTT- H IVGD I RKAH 
TRPSNNTRKS I PMGPGKAFYTTGDI IGDIRQAY 
TRPNNNTRKS IH I G PGRTFFTTGD I IGDI RQAH 
TRPNNNTRKS INIGPGRAFYATGEI IGN I REAH 
ERPNNNTKRS ITI G PGRAFDAYGG I IGDI RQ AH 
TRPNNNTRKS I HMG PGKAFYTTGE IVGD I RQAH 
TRPNNNTRKGIH IGPGGAFYATGGI IGDIRQAH 
TRLNNNTRKSINIGPGRAFYATRDI IGDIRQAH 
TR PNNNTRKS I H I G PGRS F YTTGD I IGD I RQAH 
TRPNNNTRKS IHIGPGRAFYTTGD I IGDIRQAH 
TR PNDNTRKS I PMG PGKAF YATGDI IGNIRQAH 
TRPNNNTRKS I H IGPGRAFYTTGS I IGDI RQAH 
TRPNNNTRKG IT IGPGRAFYATEKI IGDIRRAY 
IRPNNNTRKS IP IGPGRAFYATGDI IGD IRKAH 
TR PNNNTRKS I P IGPGRAF YATGD I IGDIRQAY 
TRPNDNTRKSIHIGPGRAFYTTGQIIGNIRQAH 
TRPNNNTRKS I HMGPGSAFYATGDI IGNI RQAH 
TRPNNNTRKSIPIGPGRAFFTTGDI IGDIRQAH 



HIV input: 3/ 10 
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TRPNNNTRRS IHIGPGRAFYATGDI IGDIRQAH HTV input: 4/10 

TRPSNNTRKGIHIGPGGAFYTTGEI IGDIRQAH 

TRPSNNTRKS I H I GPGRAFYAT-DI IGDIRQAH 

TRPKNEIKRRIKIGPGRAFVATGT-VGDTRQAQ 

TRPNNS I KRRIHIGPGRAFFATNT- VGDTRQAQ 

TRPDNEIRRSLQVGPGRAFVAAGT— AGDTRQAQ 

TRPGNNTRRSIHIGPGRAFFATGDITGDIRQAH 

TRPhlNNTRKSITIGSGRAFHAIEKIIGNIRQAH 

TRPS KTTRRRIH IGPGRAF YTTKQ I AGDLROAH 

TRPNNNTRXSIRIGPGRAFVTIG-KIGNMRQAH 

TRPNNNTRKSIHIGPGKAFYATGEI IGDIRQAH 

TRPNNNTRKS IH IG PGSAFYTTGD I IGDIRQAH 

TRPNNNTRKRVTMGPGRVWYTTCEIIGNIKQAH 

TRPNNNTRKGIHLGPGGTFYATGEIIGDIRQAH 

IRPNNNTRKS INIGPGRAFYTTGEI IGDIRQAH 

TRPNNNTRRG I HIGLGRRFYT-RKI IGDIRQAH 
TRPHNNTRKS I HIGPGRAFYTTGEI IGDIRQAH 
TRPGNNTRRS I PIGPGKAFFTT-EI IGDIRQAH 
TRPNNNTRKSIHIGLGRAFYTTGDI IGDIRQAH 
TRPNNNTRXSI P IGPGRAFYATGEI IGDIRQAH 
TRPNNNTRKS I P IGPGRAFYTTGE I IGDIRQAH 
TRPNNNTRKSIHIGPGRAFYTTGEIIGNIRQAH 
TRPNNNTRRS IG IGPGRA I YATDR I VGN I RQAH 
IRPNNNTRKS I SIGPGRAFYATGEI I GN I RQAH 
TRPNNNTRKGIHIGPGRAFYATERIIGNIRQAH 
TRPNNNTRRG I HIGPGRAVYTTGK I IGDIRQAH 
TR PSNNTRR S IH IG PGRA F YTTGQ ITGN I RQAH 
TR PNNNTRK S I Q IG PGRAFYTTG E I IGNI RQAH 
TRPNNNTRKS I H IG PGRAF YTTGD I IGD I RQAH 
TRPNNNTRKS I HIGPGRAFYTTGEI IGDIRQAH 
TRPNNNTRKG I HIGPGRAFYTTGEI IGDIRQAH 
TRPNNNTRKS I H IGPGRAFYATGEI IGDIRQAH 
TRPNNNTRKRMTLGPGKVFYTTGEI IGDIRQAH 
IRPNNNTRKS I HIGPGRAFYTTGEI IGD I RQAH 
TRPNNNTRKS IHIGPGRAFYTTGEI IGDIRQAH 
TRPNNNTRKSI H IGPGRA FY ATGEVIGDI RQAH 
TRPNNNTRKGIHIGPGRAFYTTGDI IGDIRQAH 
TRPNNNTRKS IHIGPGRAFYTTGEI IGDIRQAH 
IRPNNNTRKS I HIGPGRAFYTTGEI IGDIRQAH 
TRPNNNTRKS I PIGPGRAFYTTGDI IGNI RQAH 
I RPNNNTRRS I P IGLGS AFYTT - E I IGDI RQAH 
TR PNNNTRKS I HMG PGKTFYTTGDI IGDI RQ AH 
TRPNNNTRKS I HIGPGRAFYTTGQ I IGDIRQAY 
TRPNNNTRKS I P IG PGRAF YTTGEI IGDI SQAH 
TRPNNNTRKS IHIGPGRAFYATGDI I GDI RQAH 
TRPNNNTRKS IHIGPGRAFYATGEI IGDIRQAH 
I RPGNNTRKS I P IGPGRAF YATGDI IGDI RQAH 
TR PNNNTRKGIRIGPGRAFI AATKI IGDIRQAH 
TRPNNNTRKS I PIGPGRAFYTTGDI IGDIRQAH 
TRPNNNTRKS I H IGPGRAFYATGEI IGDIRQAH 
TR PNNNTRKG IH IGPGRAF YATEAI IGDI RKAY 
TRPNNNTRKGIHIGPGKAFYTTGEI IGDIRQAH 
TRPNNNTRKS IHIGPGRAFYTTGEI IGDIRQAH 
TRPNNNTRKS I N IGPGRAFYTTGGL IGDIRQAH 
TRPNNNTRKS IH IGPGRAFYTTGEI IGDIRQAH 
TRPNNNTRKS IHIGPGRAFYTTGEI IGDIRQAH 
TRPNNNTRKS I HIGPGRAFYTTGEI IGDI RQAH 
TRPNNNTRKS IHIGPGGAFYATGEI IGDIRQAH 
TRPNNNTRRG I H IG PGRAFYTTGQI IGNIRQAH 

TRPNNNTRKG IHIGPGRAFYATGDI IGD IPvQAH 

I SPNNNTRKS IHIGPGRAFYTTGEI IGDIRQAH 

TRPNNNTRKS IH I GPG RAF YTTGDI IGDIRQAH 

TRPNNNTRKS IHIiGPGKAVYTTGEI IGDIRQAH 

TR PNNNTRKS I P IGPGRAFYTTGEI IGDIRQAH 

TR PNNNTRKS I H IGPGRAF YATGE I IGDI RQAH 

TRPNNNTRKS IH IGPGRAFYTTGEI IGNIRQAH 
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TRPNNNTRKS I H IG PGRAF YATGDI XGDXRQAH fjjy inpuj. 5/ JQ 

TRPNNNTRKS I H IGPGRAFYTTGDI IGDIRQAH F 

TRPNNNTRKSINIGPGRAFYATGEI IGDIRQAH 

TRPNNNTRKS IHIGPGRAFYATG EI IGDIRQAH 

TRPNNNTRRS I PIGPGRAFYATGNI IGDIRQAH 

TRPNNNTRKS INIGPGRAFYTTGEI IGDISQAH 

TRPFNNTRKS I P IGPGRAFYTTGDI IGDIRQAH 

TRPNNNTRRS I HIGPGRAFYTTGGI IGDIRQAH 

TRPNNNTRKS I HIG PGRAFYTTGDI IGDIRQAH 

TRPNNNTRIGIHIGPGRAFYATGEI IGDIRQAH 

TRPNNNTRKS INTG PGRAFYTTGDI IGDIRQAH 

TRPSNNTRKGIQIGPGRAFYTTGQITGDIRQAH 

TRPNNNTRKG I HIG PGRAFYATGE I IGNXRQAH 

TRPNNNTRKSITIGPGRAFYTTGEIIGDIRQAH 

TRPNNNTRKSIHIGPGRAFYTTGEIIGDIRQAK 

TRPNNNTRKSIHIGPGRAFYTTGEI IGDIRQAH 

TRPNNNTRKS I H IG PGRAFYATGE I IGDIRQAH 

TRPNNNTRKS I HIG PGRAFYTTGEI IGNIRQAH 
TRPNHNTRRGIHIGPGRAVYTTGEIIGNIRQAH 
TRPNNNTRKS IHIGPGRAFYATGDI IGDIRQAH 
TRPNNNTRKS I NIGPG RAF FTTGK I IGDIRQAH 
TRP SNNTRKX I H IG PGRAFYATGE I IGDI RQ AH 
TRPNNNTSKGIHIG PGRAFYTTGDI IGDIRQAH 
TRPNNNTRKGIHIGPGRAFYATGEI IGDIRQAH 
TRPGNNTSRG I HIGPGRAFYTTXKI IGDIRQAH 
TRPNNNTRKS I NIGPG RAF YTTGDI IGDIRQAH 
TRPNNNTRKS I PMG PGRAFYTTGDI IGNIRQAH 
TRPHNNTRKSI PIG PGRAFYTTGEI IGDIRQAH 
TRPNNNTRKG I HIGPGRAFYTTGEI IGNIRQAH 
TRPNNNTRKSIHIAPGRAFYATGEIIGDIRQAH 
TRPNNNTRKS IXIGPGRAFYATGEI IGDIRQAH 
TRPNNNTRKS INIGPGRAFYTTGEI IGDIRQAH 
TRPNNNTRKS I PIGPGRAFYTTGQI IGDIRQAH 
TRPNNNTRKG I H IG PG KA F YATGE I IGN I RQ A Y 
TRPNNNTRKGIHIGPGSAFYATGEI IGDIRQAH 
TRPNNNTRKS I HIGPGRAFYTTGEI IGDIRQAH 
TRPNNNTRKS I H IG PGRAF YTTGDI VGDI RQA Y 
TRPNNNTRKS I HIGPGRA FY ATGEI IGDIRQAH 
TRPNNNTRKS I H IGPGRAFYTTGDI IGDIRQAH 
TRPNNNTRKS I HIG PGRAFYATGQI IGDIRQAH 
TRPNNNTRKG I H IG PGRAF YATGD I I GDI RQAH 
TR PNNNTI KS I H IG PGRAF YTTGQ I IGDIRQAH 
TRPNNNTRKG I H IG PGRAF YTTG ? I IGDIRQAH 
TRPNNNTRKS I T IGPGRAFYTTGDI IGDI RQAH 
TRPNNNTRRS IN IG PGRAFYATGE I IGDIRQAH 
TRPNNNTRKSIHI APGRAFYATGEI I GD IRQ AY 
TRPNNNTRKSIHIGPGRAFYATGA I IGNIRQAH 
TRPNNNTRKS I HLG PGQAWYATGEI TGDI RQAH 
TRPNNNTRKS I HLGQGQAWY ATG E I IGDIRQAH 
TRPNNNTRKS I HLG PGQAWYTTGQI IGDIRQAH 
TRPNNNTRKS I PLG PGRAWYATGEI IGDI RQAH 
TRPNNNTRKS I PLG PGQAWYTTGQI IGDIRQAH 
TRPNNNTRKG I H LG PGQAWYTTGQI IGD I RQAH 
TRPNNNTRKS I PLG PGQAWYTTGQI IGDIRQAH 

TRPNNNTRKS I P LG PGQVWFTTGQ I IGDI RQAH 

TRPNNNTRKSIHLG PGQAWYTTGQI IGDIRQAH 

TRPNNYTRKXIXMGPGRXXYTTGEI IGDIRRAH 

TR PNNNTRKS I H LG PGRAWYTTGQ 1 1 GD I RQAH 

TRPNNNTRKS IHLGPGRAWYTTGQ I IGDIRQAH 

TRPNNNTRKS I PLG PGQAWYTTGQI IGDIRQAH 

TRPNNNTRKG I PIG PGRAFYTTGDI IGD I RQAH 

T RPNNNTSKG I PIGPGRAFYATGX I IGDIRQAH 

TRPNNNTPJCG I HIG PGRAFYTTGEI IGDIRQAH 

TRPNNNTRKG I HIG PGRAFYTTGEI IGD I RQAH 

TRPNNNTRKG I HIG PGRAFYTTGEI IGDIRQAH 

TRPNNNTRKG I HIG PGRAFYTTGEI IGDI RQAH 
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TRPNNNTRKSI PIGPGRAFYTTGQI IGDIRQAH HIV input' 6/ 10 

TRPNNNTRKG I H IGPGRAFYTTGEI VGD I RQAH 

TRPNNNTRKGIHIGPGRAFYTTGG I IGDIRQAH 

TR PNNNTRKS I HMGOGRA F Y ATGG 1 1 G D I RQA Y 

TRPNNNTRKG I HLG PGQAWYTTGO I IGDI RQAH 

TRPNNNTRKG I PLG PGQAWYTTGQ I IGD I RQAQ 

TRLNNNTRKS I AIG PGRTVYATDRI IGDIRQAH 

TRPSKNIRRSIHIGSGRAFYTIEGVAGDVRKAY 

TRPNNNTRRGIHIGPGRAFYATGNI IGDIRQAH 

TRPSNNTRKS IH IG PGRVFHATGEI IGD I RQAH 

TRPNNNTRKR I Y I G PGRA VYTTEQ I IGNIRQAH 

TRPGNNTRER I S IG PGRAF I AKGQ I IGDI RQAH 

TRPGNNTRKSIPIGPGRAFIATSQIIGDIRKAH 

I RPNNNTRKG IGXG PGRTVYTAEK I IGD I RQAH 

TRPNNNTRKSIHIGPGRAFYTTGEIIGDIRQAH 

TRPNIYRKGRIHIGPGRAFHTTRQI IENIRQAH 

TRPNNNTRKS I HI G PGRAF YTTG E I IGDIRQAH 

TRPNKKTRKRITTGPGRVYYTTGEIVGDIRQAH 

TRPNNNTRKRITMGPGRVYYTTGQr IGDI RRAH 

IRPNNNTRKGINVGPGRALYTTGDI IGDIRQAH 

TRPNNHTRKRVTLG PGRVWYTTGE I LGNIRQAH 

TRPNNNTRKS I TLG PGRA FYTTGD I IGD I RQAH 
TRPNNNTRK1-IHIAPGRAFYTTGDIIGDIRKAH 
TRPSNNTRKS IH IG PGRAF YTTGE I IGDI RQAH 
TRPGNNTRKS I PMGPGRAF YATGDI IGDIRKAH 
TR PNYNKRKR I H IG PGRA F YTTKN I IGT I RQAH 
TRPNNNTRKG I AIG PGRTLYAREK I IGD I RQAH 
TRPNNNTRRRLS IG PGRAF YARRN I IGDIRQAH 
TRPNTKK I RH IH IG PGRAF YATGG IMGDI RQAH 
TRPNNNTRRSINIGPGRAFYTTGDIIGDIRQAH 
TRPNNNTSKRI S IG PGRA FVAARE I IGDIRKAH 
IRPNNNTRKS IS IGPGRAFYTTGEI IGD I RQAH 
TRPNNNTTRSIHIGPGRAFYATGDI IGDIRQAH 
TRPNNNTRKS ITIGPGRAFYATGDIIGDIRQAH 
TRPNNNTRKS IYIGPGRAFHTTGR I IGDIRKAH 
TRPNNNRRRRITSGPGKVLYTTGEIIGDIRKAY 
I RPNNNTRKG I H IG PG KAF YTTG E I IGN I RQAH 
TRPNNNTRKS I N I G PGRAL YTTGE I IGDI RQAH 
TR PNNNTRKG I H IG PGRAFYATGE I IGDIRQAH 
TRPNNNTRRSIPMGPGKAFYTT-EI IGNIRQAH 
TRPSNYTGKRLS IG PGRAFVATRKI IGDIRQAH 
TRPGNNTRKS ITMGPGKVFYA-GE I IGDIRQAH 
TRPNNNTRKS I PMG PGRAFYTTGE I IGDI RKAY 
VRPSNNTRQS I PIG PGKAFYATGEI IGDI RKAH 
TRPNNNTRRSVH IG PGSALYTT - D I IGDI RQAH 
I RPNNNTRRS INKGPGRAFYTTGD I IGD I RQAH 
TRPNNNTRRS IH IG^GRAWYTTGK ITGDI RQAH 
TRPNNNTRKR ITMGPGRVTjYTTGQ I IGDVRRAH 
TRPNNNTRKS I H I A PGRAF YATG E I IGDIRQAH 
TRPNNNTRKGIHIGPGRAFYATGDIIGDIRQAY 

TRPSNNTRKGIPIGPGRAFYTTGGI IGDIRQAH 

TRPNNNTRKS I HI APGRA FY ATGG I IGDIRQAH 

TRPNNNTRRS I NMG PGRAF YTTG D I IGDIRQAH 

TRPSNNTRKS IT IG PGRAF YTTG EV IGD I RQAH 

TRPNNNTRRG I H IG PGRAFYTTGE I IGD I RQAH 

TR?NNNTRKS I P Tr ! PGRAF YATGDI IGDIRQAH 

TRPNNNTRKS I HiGPGKAFDAT-DI IGDI RQAH 

TRPNNNTRKS I H IGPGRAFYATGEI IGD I RKAH 

TRPNNNTRKGI HMG PGRAF YTTGAI IGD I R EAH 

TRPNNNTRRS ITIGPGRAFYAT-DI IGDIRQAH 

TRLSNKTRRS I H IGPGRAFYAT- DI IGD I RQAH 

TRPNNNTRRS I H I APGRAFYATGDI IGD I RQAY 

TRPNNNTSRRISIGPGRAFTAREGI IGDIRQAH 

TRPNNNTRRS IHIGPGKAFYATGG I IGDIRQAH 

TRPNNNTRKS I H IG PGRAF YTTGD I IGDIRQAH 

TRPNNNTRKS I H IGPGRAFYATGDI IGDI RQAH 
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TRPNNNTRKSIHIGPGRAFYTTGDI IGDIRQAH HIV input: 7/10 

TRPNNNTRKSIHIGPGSAFYTTGDIIGHIRQAH 

TRPNNNTGKS I H LAPGRGFHATG EI IGNI RQAH 

TR PNNNTRKG I A I G PGRTVY ATG R I IGDIRQAH 

TR PNNNTRKS I H I G PGRA F YATGG I IGE I RQAH 

TRPNNNTRKGIPIGPGRAFYTTGDI IGDIRQAH 

TRPNNNTRKSIHIAPGRAFYATGEI IGDIRQAH 

SRPNNNTRKG I HIGPGRAFYATGD I IGDIRQAH 

TRPGNNTRRSIHIGPGRAFYTTGEI IGNIRLAH 

TRPNNNTRKS I PIGPGRAFYATGD I IGDIRQAH 

TRPNNNTRKSIHIGPGRAFYTTGDI IGDIRQAH 

TR rNNNTRKGI HIG PGRAF YATGEI IGNI RQAH 

TRPNNNTRKS I HIGPGRAFYATGD I IGDIRQAH 

TRPNNOTRKGIHIGPGRAFYTTGEVIGNIRQAH 

TRPNNNTRKSXPMG PGKAMYATG E I IGD I-RKA Y 

TR PNNNTRKS I H IG PGRAF YTTG E I VGDI RQAH 

TRPNNNTRKSIHIGPGRAFYAT-DI IGDIRQAH 

TRPNNNTRKS I PMG PGRAF YTTG EV I GN I RQA Y 

TR PNNNTRKS I HIG PGRA FHTTG EV IGD I RQAH 

TRPNNNTRKS I NIGPGRAFYATG EI IGDIRQAH 

TR PNNNTRKS I N I G PGRA F YTTG E I IGD I RQA H 

I R PNNNTRR S I HMG PGRAF YATGD 1 1 GD I RQAH 

IRPNNNTRRSINIGPGRAFYTTGDIIGNIRQAH 

TRPGNKTIRS ISMG PGRAF- RTGQIIGNIRQAN 

TRPNNNTRKS I P IGPGRAFYATGDI IGDIRQAH 

TRPNNNTRRS IH IAPGRAFHATGN I IGDIRQAH 

TRPSNNTRKS VHIGPGRAF YTTGE I IGDI RQAH 

TRPNNNTRKS I HLG PGRAFYATG E I IGDIRQAH 

IRPNNNTRKS IHIGPGRAFYTTGDI IGDIRKAH 

TRPNNNTRKS I HIG PGRAF YTTGE I IGDIRQAH 

TRPNNNTRKSIHIGPGRAFYTTGQI IGDIRQAH 

TR PNNNTRKS I P I G PGRAF YTTG D 1 1 G D I RKAH 

TRPSNNTRRS IHMGLGRAFYTTGDI IGDIRQAH 

TR PNNNTRKG I H IG PGRA F YTTGQ 1 1 GD I RKAH 

TRPNNNTRRS I PIG PGRAF YTTGQ I IGDIRQAH 

IRPNNNTRKS ITMGPGKVFYVT- DI IGDIRQAQ 

TRPSNNTRKR I A IGPGRAVYTTEQI I GD I RRAH 

ERPNNNTRKS INI G PGRAF YATGD I IGDIRQAH 

TRPNNNTRKS IRIGPGQTFYATGDI IGDIRQAH 

TRPNNNTRKS IRIGPGQAFYATGEI IGDIRQAH 

TRPNNNTRKS I SLGPGQAFYATGDI IGNI RQAH 

TRPNNNTRES I R I G PGQT F Y ATG D 1 1 G D I RQAH 

TRPNNNTRQS IRIGPGQTFYATGDI IGDIRQAH 

TRPNNNTRKS IRIGPGQTFYATGDI IGDI RQA Y 

TRPNNNTRKGVR IGPGQTFY ATGDI I GDI RQ AH 

TR PNNNTRKS I R I G PGQT F YATGD I IGDI RQAH 

TRPNNNTRKS IRIGPGQTFYATGDI IGDIRQAH 

TRPNNNTRKS IRIGPGQTFYATGDI IGDIRRAY 

TRP SNNTRKS I RIG PGQT FYATGEI IGDIRQAH 

TRPNNNTRKS LR IG PGQTF YATGD I IGD IRRA H 

TRPNNNTRK S T R I G PGQT F YATGD I IGDI RQAH 

TRPNNNTRKS IRIGPGQTFYATGDI IGDIRRAY 

TRPNNNTRKS I RIGPGQAF YATGD I IGD IRQ AY 

TRPNNNTRKS I RIG PGQAFYATNDI IGNI RQAH 

TRPNNNTRQS I RIG PGQVFYATKDI IGDIRQAH 

TRPTNNT 0 OSIRIG PGQAFFATKG 1 1 GD I RQAH 

TRPNNNTKKS IRIGPGQTFYATGDI IGDIRQAH 

TRPNNNTRKS IRIGPGQAFYATGGI IGDIRQAH 

TRPNNNTRKSVRIGPGQTFYATGDI IGDIRQAY 

TRPNNNTRKSVRIGPGQTFYATGDI IGNI RQAH 

TRPGNNTRKSMRIGPGQPFYATGDIIGNIRQAH 

TRPNNNTRKS I RIG PGQAFYATNDI IGDIRQAH 

TRPNNNTRKSMRIGPGQTFYATGDI IGNI RQAH 

TRPNNNTRKSVRIGPGQTFYATGDI IGDIRQAH 

VRPNNNTRKSIRIG PGQTF YATN* ********* 

TRPNNNTRQSVR IG PGQ AFY AT KD I IGDIRQAH 
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TRPGNNTRKS I RIGPGQTFYATGDI I GD IRQ AH HTV input: 8/10 

TRPNNNTRRS I RIGPGQVFYANNDI IGDI RQAH 

TRPNNNTRKSIRIGPGQTFYATNEIIGNIREAH 

ARPNNNTRKSMRIGPGQTFYATGDI I GDI RQAH 

TRPNNNTRKSVRIGPGQTFYATGDI IGDI RQAH 

TRYANNTRKSVRIGPGQTFY-TNDI IGDI RQAH 

ARPNNNTRES I RIGPGQTFYATGDI IGDIRQAY 

TR PNNNTRKRI RVG PGQTVYATNAI IGDIRQAH 

TRPSNNTRKS IRIGPGQAFYATGGI IGNIRQAH 

AR PGNNTRKS I RIGPGQTFFATGAI IGDI RQAH 

TRPNNNTRKS I RIGPGQTFYATGDI IGNIRQAH 

TRPYNNTRQRTHIGPGQALYTT-RI IGDIRQAH 

TRPNNYKRQGTPIGLGQALYTT-RVIGDIRKAH 

TRPNNNTRQGTH IG PGQALYTT-GVIGDI RKAH 

TRPYNNTRQSTRIGPGQTLFTT-KI IGDIRQAH 

TRPYNNTRQGTHIGPGRAYYTT- Nil GDI RQAH 
TRPYNNTRQGTHIGPGQTLFTT-KI IGDIRQAH 
TR P YNNKRQRTP IGLGQVI*HTT-RVKGD IRQ AH 
TR P Y SRVRQGAH IG PGRAYYAT-NI FGD I RQAR 
TRPSNNTRQSTRIGPGQALYTN-KI IGNIRQAH 
ARPYNNTRQSTRIG PGQALFTS-KI IGNIRQAH 
TRP YENMRQRTPIGLGQALVTS -RI KGR I RPA Y 
TRPYNNTRQGTHIGPGRAYYTT-RILGNIRQAH 
TRP YNNT IQGTH IG PGRAYYTTISVIGDI RQAH 
TR PYNNTIQKTS IGRGQALYTT- ETRGDI KQAF 
TR PYNNIRQRT P IGSGQALYTT-RR IGD I RQAY 
TRP YNNTRQGTHIG PGRA YYTT- RIVGNIRQAH 
TRPYNNTRQSTHFGPGRAYYTT- DI IGDI RQAH 
TRPNNNTRQSTQIGPGQALFTKTRI IGDIRQAH 
TRPYENVRHRTPIGLGQALITN-RIKAKIGQAY 
TRP YNQ I RQRTS IGQGQAL YTT - RVTGDI RKA Y 
TRPYNNTRKGIHIGPGRAYYTT-NIVGNIRQAH 
TRPYDKVSYRTPIGVGRASYTT-RIKGDIRQAH 
TRPYNNIRQRTPIGLGQALYTT-RRIEDIRRAH 
IRPYNNTREGTHIGPGRALFTT-DI IGDIRQAH 
ARPYAIERQRTPIGQGQVLYTT-KKIGRIGQAH 
TRPNNNTRQSTHIGPGQAIYTLTKWGDIRQAH 
S RP YENKRRRTP IGLGQAYYTT-KLKG YI RPAH 
TRPEKIKRRGTPIGLGQAYLTT-QITGYIRQAH 
TRPYRNIRQRTHIGTGQAYYTK-GIKGVAGQPH 

IRPNKTKIQRTSIGLGQALYTNDKI IGNIRQAY 
AR P Y I KI WRRTH IGSGOAYSTK -R I QNYTGP A.H 
TRPKNITIQRTPIGLGQALYTT-KRIGVIGQAS 

SRPRNVT I QRTS I GSGQALYTT - KR1GY I KQAH 

TRPYHNKIQRTHIGTGQALHTT-RITGYIGQAH 

TRPYYNIRQRTPIGLGQALYTTRGTTKVIGQAH 

TRP YNKTSQRT S IGQGRALYTT-KPTGY IRQAY 

5RPYKSTRIRTKIGSGQAYYRT-NIQGDIRQAY 

TR P Y RAMRRRT S IGQGQA YYTTTG I GGN I RQAY 

TRPYSNKRQSTPIGLGQALYTT-RGRGDIRKAH 

ARP YEKKRRTTPIGLGQALITS -RNFEKIGQAH 

TRPYKS IRRIGPGRWQTYY — TTNITGRAH 

IRPNKRTRQRTHIGSGQALYTT-KIVGDIRQAH 

TRPDN I KRQRTP IGQGQAL YTTRLTTRRIGQPH 

MR PYNNKRQSVH IGPGRAFYTT- N I IGDIRQAH 

TRPYNNTRQGTHIGPGRAYWTT-NI IGDIRQAH 
TRP YNNTRQG IH IG PGRAYYTD-QITGDIRQAH 
TRPSNNTRKS I H IGPGQALFT I -DI IGNI RQAH 
TRPNNNTRQSTH IGPGQALYTT- K I IGDI RRAH 
TRP ANNTRQSVHliGPGQALYTT-RVI GDI RQAY 
TRPYNNIKIQTPIGRGQALFTT-RIKGIKGQAH 
TRPNNNTRQS IH IGPGQALYTT -NV IGDIRQAH 
TRPYTNKRQGTHMGPGRALYTI -DITGDIRQAY 
TRP YNNTRQSTH IG PGQALYTT -N I IGDIRQAH 
VRPYSNQRRRTP IGLGQALYTTMDNMKNIKQAY 
TRPYNNIKIQTPIGRGQALFTT-RRKGIKGQAH 
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TRPYTKTRH RAQGRAWWTTGI TGD I RQAY HIV input: 9/10 

ARPYENIRQRTPIGTGQALYTTKK- IGKIGQAH 

TRPYSKERLKTS IGQGQALYTTVKVTGDI RQAH 

ARPYQNTRQRTPIGLGQSLYTT-RSRSIIGQAH 

TR PNK I TRQSTP IGLGQALYTT- RI KGDIRQAY 

TR PGNNTRRG I H FG PGOALYTT-G I VGDIRRA Y 

TRPYK YTRQRTS IGLRQSLYTI KKKTGYIGQAH 

TRP YRNIRQRTSIGLGQALYTT-KTRSI IGQAY 

IRPNNNTRQSTHI/3PGQALYTT-KVIGDIRQAY 

TR PNNNTRKS I HIGPGQAI YTT- DVIGDIRQA Y 

TRPNNNTRKG I H IG PGQALYTSGDI VGDIRQAH 

TRPNNNVRQRTPICPGQAFYTTG* ********* 

TRP SNNTRTS IT IGPGQVFYRTGDI I GDI RKAY 
TRPFKNMRTSARIGPGQVFYKTGSITGDIRKAY 
TP.PF KKVRIS ARIG PGRVFHTTGNINGDIRKA Y 
TRPFKRVRTSVRIGPGRVFHKTGAINGDIRKAY 
TRPSNNTRTSVRIGPGQVFYXTGDI IGDIRRAY 
TRP FKKTRI SARIGPGRVFHKTGAI LGDIRKAF 
TR PSNNTRTSVRIGPGQVFYKTGEI IGDI RKAF 
TRPSNKIRTSVRIGPGOVFYKTGAIMGDIRKAF 
TR PSNNI RTS VRIGPGOVF YKTGS ITGDIRKAF 
TRPFKKMRTSVRIGPGRVFYKTGSITGDIRKAY 
TRPYKNTRTSARIGPGQVFYKTGSITGDIRKAY 
TR PSNNTRTSVRIGPGQVF YGTGEI IGDI RRAF 
TRPSTTIRTSSRIGPGQAFYKIEGISGNIRAAY 
TRPSNNTRTRIT IGPGQVFYRTGDI IGD I RKAY 
TRPSNNTRTSITIGPGQIFYRTGDI I GDI RKAY 
TRPSNNTRTS IT IGPGQVFYRTGDI IGDIRKAY 
TR P SNNTRTS IT I G PGQVF YRTGD 1 1 GD I RKAY 
TR PSNNTRT SIT IGPGQVFYRTGDI IGNIRKAY 
TRPSNNTRTS ITIGPGQVFYRTGDITGNIRKAY 
TRPSNNTRTS I P IGPGQVFYRTGDI IGNIRKAY 
TR PSNNTRTS ITMGPGQVFYRTGDI IGDI RRAY 
TRPSNNTRPSITIGPGQVFYRTGDIIGDIRKAY 
TRPSNNTRTSITIGPGQVFYKTGDI IGNIRKAY 
TRP SNNTRTS I PIG PGQVF YRTGD I IGD I RKAY 
TRPSNNTRTS IT IGPGQVFYRTGDI IGDIRKAY 
TRPSNNTRTS I P IGPGQAFYRTGDI IGDI RKAY 
TRP SNNTRTS IT IGPGQVFYRTGDI IGNIRKAY 
TR PSNNTRTS TT IG PGQVF YRTGD I IGDI RKAY 
TRPSNNTPTSITIGPGQVFYRTGDI IGDIXKAY 
TRPSMNTXPSITXGPGQVFYRTGDIIGDIRXAY 
TRPSNNTRTS IT IGPGQVFYRTGDI IGDI RKAY 
TRPSNNTRTS IN IGPGQVFYRTGDI IGDIRKAY 
TRPSNNTRTS ITVG PGQVF YRTGDI TGDI RKAY 
TRP S NNTRT SIP IG PGQVF YRTGD 1 1 GD I RKAY 
TRPSNNTRTS ITIGPGQVF YRTGD I IGD I RQAY 
TRP SNNTRTS IN IGPGQVFYRTGDI IGDIRKAY 
TRP SNNTRTS IT IGPGQVFYRTGDI IGDIRKAY 
TRPSNNTRTS ITIGPGQVF YRTGDI IGNIRKAY 

TRP SNNTRTG IT IGPGQVFYRTGDI IGDIRKAY 

TRPSNNTRTSITIGPGQIFYRTGDI IGD I RKAY 

TRP SNNTRTS IT I G PGQVF YRTGD I IGDIRKAH 

TRP SNNTRTS LT IGPGQVFYRTGDI IGDIRKAY 

TRP SNNTRTS LT IGPGQVFYRTGDI IGDIRKAY 

TRPSNNTRTS IT IGPGQVFYRTGDI IGDIRRAY 

TRPSNNTRTS I N I G PGOVF YRTGD I IGDIRKAY 

TRP SNNTRTS I T IG PGQVLYKTGD I IGDI RKAY 

T 3. P SNNTRTSTT I G PGQ VF Y RTGD I TGN I RKAY 

TR PSNNTRTSVRIGPGQVF YRTGD I IGDIRKAY 

TRPSNNTRTS ITIGPGQVFYRTGD I IGNIRKAY 
TRPNNNTRKS I HLGPGQAFYATGDI IGDIRKAH 
TRPNNNTRKSIQLGPGRAFYTTGEI IGDIRKAH 
TRPNNYTRKS IYFGPGRAFHTAGKI IGDIRKAH 
TRPNNNTRKG IHIGPGRAFYATGD I IGDIRKAH 
TRPNNNIRKS I PLGPGRAFYATGEI IGDIRKAH 
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TRPSKTIRRRIRIGLGRVFYAT-GVNGDIRKAY mV , 0/ , n 

TRPNNNTRKSIHIGPGRAFYATGDIIGDIRKAY WpOI. 1U/ 1U 

TRPNNNTRKS I R IG PGQVFYATGDI IGDIRKAY 

TRPNNNTRKG I RIG PGR V I YATS A ITGDI RQAH 

TRPNNNTRKS I H LG PGQAFYATGD I IGDIRKAH 

TRPNNNTRKS I H LG PGQAFYATDDI IGDIRKAH 

TRPNNNTRKS I HLG PGQAFYATGD I IGDIRKAY 

TRPNNNTRKS I H LG PGQAFYATGD I IGDIRKAH 

TRPNNNTRKS I HLG PGQAFYATGD I IGDIRKAY 

TRPNNNTRKS I HLG PGQAFYTTGDI IGDIRKAH 

TRPNNNTRKS I HLG PGQAF YATGDI IGDIRKAH 

TRPNNNTRKS I HLG PGQAFYATGD I IGDIRKAH 

TRPNNNTRKS I HLG PGQAFYATGD I IGDIRKAH 

TR PNNNTRKS I HLG PGQAFYATGDI IGDIRKAH 

TRPNNNTRKSIHLGPGQAFYATGXIIGNIRKAY 

TRPNNNTRKG IHIGVGRPFYRTVDIVGDIRKAH 
TRPNNNTRKS I HLG PGQAFYATGDI IGDIRRAY 
TRPNNNTRKS IX LGPGQAFYTTGNI IGDIRKAH 
TR PNNNTRKS I H IG PGQAFYATGDI IGN I RKAH 
TRPNNNTRKSIHLGPGQAFYATGNIIGDIRKAH 
TRPNNNTRKS I H I G PGQAXYTTGDI IGDIRKAH 
TRPNNNTRKS I HLG PGQAFYATGD I IGDIRKAH 
TRPNNNTRKS! HLG PGQAFYTTGDI IGDIRKAH 
TRPNNNTRKS I HLG PGQAFYTTGDI IGDIRKAH 
TRPNNNTRKS IHLG PGQAFYATGDI IGDIRKAY 
TRPNNNTRKS I H LG PGQAFYATGDI IGDIRKAH 
TRPNNNTRKS I HLG PGQAFYATGDI IGDIRKAH 
TRPNNNTRKS I HLG PGQAF YATGDI IGDIRKAH 
TRPNNNTRKS I H LG PGQ A F Y ATGG I IGNI RKAH 
TRPNNNTRKS IHLG PGQAFYATGDIIGDIRKAY 
TRPNNNTRKS I HLG PGQAFYATGDI IGDIRKAH 
TRPNNNTRKS I H IG PGQAFYATGDI IGDIRKAH 
TRPNNNTRKSIH IGPGQAFYATGDI IGDIRKAH 
TRPNNNTRKS I H IGPGQAFYATGEVIGDIRKAH 
TRPNNNTRKS I HLG PGQAFYATGDI IGDIRKAH 
TRPNNNTRKS I H LG PGQAFYATGD I IGD I RKAH 
TRPNNNTRKSI HLG PGQAFYTTGEI IGDIRKAH 
TRPNNNTRKS IT IGPGQAFYATGDI I GDI RQAH 
TRPNNNTRKS IS FG PGQAFYATGD I IGDI RQAH 
TRPNNNTRKS I H IG PGQAL YATGAI IGDI RQAH 
TR PNNNTRKS I KFGTGRVLYATGA I IGNI RQAH 
TRPNNNTRKS I RIG PGQ AFYATGE I IGD IRQ AH 
TRPNNNTRKS ITLG PGQAFYATGDI IGNI RQAH 
TRPNNNTRKS I TFAPGQAF YATGDI IGNI RQAH 
TRPNNNTRKS IP IGPGQAFYATGDI IGDI RQAH 
TRPNNNTRKSIS IGPGQAFYATGDI IGDIRKAY 
TRPNNNTRKS I S IGPGQAFYATGDI IGDIRKAY 
TRPNNNTRRSMRIGIGRGQTFHGAI IGDI RQAH 
TRPNNNTRKS IK IGPGQAFYATGDI IGDI RQAH 
TRPNNNTRKS I NIG PGRAF YATGDI IGD IRQ AY 
TRPNNT RN I RTH IGSGQA I FTT - KV IGDIRKAY 
TRPNNNTRTS I HLG PGRAF YATGDI IGDI RQAH 

TRPGNTTRRSMRIGPGRTFYTI GDI RKAH 

TRPNNNTRKSVRIGPGQTFYATGDKKGDIRQAK 
TRPNNN I RKS I R IG PGQAF F ATGD I IGNI RQAQ 
TRPNNNTRKS IRFGPGQAFYT - SDI IGDIRQAY 
TRPNNNTRRS I HVG PGQAF YATGDI IGNI RKAH 
TR PSNNTRRS IRFGPGQAFY-TNDIIGDIRQAY 

TRPGSDKKIRIRIGPGKVFYAKGGITG QAH 

ERPGIDIQE- IRIGPMA-WYSMGLGGTSSRAAY 
ERPQ I D IQE -MRIG PMA -WY SMG IGGTS SRAAY 

IREI AEVQD- I YTGPMR-WRSMLKRSNPRSRVA 

ERPGNQTIQKIMAGPMA-WYSM — NTKRA- -AY 
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APPENDIX C 



HIV output: 1/6 



0 


( 0 - 00 | 


1 


| 0 . 00 | 


2 


j 0.01 1 


3 


| 0.01 f 


4 


1 o.oi | 


5 


1 0.01 | 


€ 


f 0.02 j 


7 


| 0.02 | 


8 


| 0 . 02 | 


9 


| 0.02 | 


10 


| 0 . 03 | 


11 


I 0.03 j 


12 


! 0.03 j 


13 


0.03 j 


14 


0.04 | 


15 


0 . 04 f 


16 


0. 04 | 


17 


0. 04 | 


18 


0 . 05 | 


19 ! 


0 . 05 1 


20 | 


0-05 | 


21 | 


0.05 | 


22 | 


0.06 j 


23 | 


0.06 | 


24 | 


0.06 j 


25 | 


0.06 j 


26 j 


0 . 07 | 


27 j 


0 . 07 j 


28 j 


0 . 07 j 


29 j 


0 . 07 | 


30 j 


0 . 08 J 


31 | 


0 .08 j 


32 | 


0 . 08 j 


33 j 


0.09 | 


34 j 


0.09 | 


35 1 


0 .09 j 


36 | 


0 . 09 | 


37 j 


0.10 | 


38 j 


0.10 | 


39 j 


0.10 j 


40 | 


0.10 | 


41 | 


0.11 j 




o i 1 i 


43 | 


0.11 | 


44 j 


0.11 j 


45 j 


0.12 J 


46 j 


0.12 t 


47 j 


0.12 j 


48 j 


0.12 | 


49 | 


0.13 | 


50 1 


0.13 | 


51 i 


0. 13 j 


52 


3. 13 | 


53 


0. 14 i 


54 


0.14 i 


55 


0. 14 j 


56 


0 . 14 | 


57 


0.15 i 


58 


| 0.15 1 


59 


1 0 . 15 | 


60 


1 0.15 | 


61 


( 0 . 16 | 


62 


| 0 . 16 i 


63 


! 0.16 j 


64 


j 0.16 | 


65 


1 0-17 1 



A18|Q31|H33 S & 36019 4 15684.208314 4 0.000000 & \cr 
A18|T21 $ 
A21|D24 $ 
H12|A18 $ 
H12|R17 $ 
I11|R17 S 
L13|K31 $ 



0.000000 & 
0.000000 4 
0.000000 4 
0. 000000 4 
0.000000 4 
0.000000 4 



\cr 
\cr 
\cr 
\cr 
\cr 
\cr 



33816 4 123S2. 399254 
45549 & 17706.407140 4 
86025 & 24619.776947 4 
48257 4 19028. 783S92 4 
64548 4 27053.952336 4 
39382 4 17335.347894 4 
L13|W19|Q24 $ 4 20184 & 379.160544 4 0.000000 4 \cr 
M13(W15 S 4 23300 4 6673.177086 4 0.000000 4 \cr 
N4|K9 $ 4 162152 4 74737.922307 4 0.000000 4 \cr 
N4|K9|H33 $ 4 26376 & 5666.716129 4 0.000000 4 \cr 
86891 & 17162.233105 & 0.000000 & 
233190 & 186078.818611 4 0.000000 
53740 & 10564.956512 & 
18359.197022 
27136.429076 
10413.255892 
26805.242087 
16232.354294 
17415 .113746 
18975.126308 



017)024 
031|H33 
R12|Q17 
R12|T18 
R17|A18 
R17{E24 
R17|Q31 
R17)T21 
S10|D24 



V11|R12 $ 



& 
St 
& 
& 
& 
& 
4 



0.000000 
0.000000 
0.000000 
0.000000 
0.000000 
0.000000 
0.000000 
0.000000 
& 0.000000 
0. 000002 4 



\cr 
& \cr 
\cr 
\cr 
\cr 
\cr 
\cr 
\cr 
\cr 
\cr 
4 \cr 
\cr 



62774 
54366 
33748 
45065 
4 70301 
4 57772 
4 39546 

V11|R12|T18 S 4 17628 4 881.251263 
K31|Y33 5 4 36346 4 20803.634880 4 

4 45441 4 30227.409858 4 0.000003 4 \cr 
4 25033 4 10875.740384 4 0.000018 4 \cr 
4 20779 4 7151.794446 4 0.000041 4 \cr 
4 40098 4 27695.038620 4 0.000231 4 \cr 
4 29121 4 16875.538795 4 0.000286 & \cr 
& 29621 & 18109.021417 & 0. 000737 & \cr 
& 22348 4 10939.327036 4 0.000839 4 \cr 
$ 4 15175 & 4159.316971 4 0.O0135S 4 \cr 
S4 |T9|T12 |V18|R21 $ 4 10919 4 1.718549 4 0.001524 4 \cr 
N4|K9|A21 $ 4 11233 & 623.181959 4 0.002185 4 \cr 

4 21868 4 11328.342993 4 0.002369 4 \cr 
44400 4 34516.144368 4 0.004910 4 \cr 

4 16593 4 6991.723713 4 0.006625 4 \cr 
16738 4 7234.038664 4 0.007331 4 \cr 
4 10844 & 1492.835945 4 0.008575 4 \cr 
4 13847 4 4587.312260 4 0.009408 4 \cr 

4 33735 4 24568.179150 4 0.010326 4 \cr 
4 23076 4 14893.617567 4 0.026158 4 \cr 
4 15497 4 7516.155896 4 0.031231 4 \cr 

N4|K9|Q31 |H33 $ 4 8280 4 493.681367 4 0.036905 4 \cr 
N4|K9(A18 $ 4 11655 4 4250.900600 4 0.050618 & \cr 
S4(T9(T12!V18(R21| Y33 S 4 7370 4 0. 093039 4 0.052029 4 \cr 
R12jQ17|Ti8 $ 4 7452 4 240.364918 4 0.058992 4 \cr 
V11|Q17 $ 4 14350 4 7329.962834 4 0.068429 & \cr 
& 23263 4 16324.923094 4 0.072825 4 \cr 
4 17288 4 10374.788061 & 0.074203 4 \cr 
4 15536 4 8921.243955 4 0.092437 4 \cr 
4 6529 4 136.997153 & 0.108375 4 \cr 

5 4 1022B 4 38S4. 612095 4 0.112708 4 \cr 
4 6573 4 275.512362 4 0.115524 4 \cr 

7265 4 1223.984346 4 0.137235 4 \cr 

4 6003 4 30.417827 4 0.143515 4 \cr 
4 6380 4 549.756091 4 0.157254 4 \cr 
6150 4 620.344848 4 0.189437 i \cr 
65S5 4 1027.737537 4 0.189642 4 \cr 
5751 4 247.598509 4 0.192378 4 \cr 
4 5514 4 35.313082 4 0.195240 4 \cr 
S4iT9|T12lVlS|R21 jK31 $ 4 5462 4 0.090571 4 0.197200 4 \cr 
H12[R17|A18 $ 4 5618 4 172.948903 4 0.199184 4 \cr 
Q9 |T11 (L19 1-23 $ 4 5464 4 38.188997 4 0.201464 4 \cr 
Y4|Q9 |T11( -23 $ 4 S364 4 35.276055 4 0.213243 4 \cr 
N4|A18|Q31|H33 $ 4 6378 4 1180.344841 fi. 0.229871 & \cr 
L3|N12|R23 S 4 5114 4 15.794611 4 0.243044 4 \cr 



N4|A21 $ 
Q17|K31 $ 
G10|H12 S 
K9|A21 $ 
F19|D24 S 
Q17|A21 $ 
H12|£24 $ 
N4|K9| 111 



N4fQ31|H33 $ 
F19|A21 $ 4 
K9|031|H33 $ 
W19|Q24 $ 4 
E1(N12 $ 
K9|E24 $ 
K9|K17 $ 
T12|V18 $ 
R12|A21 S 



H12|T21 $ 
017|Y33 S 
L13IW19 $ 
Sl7jK28 5 
N4 |K9 (Q31 

xejsi7 $ 

R17jQ31|H33 $ 4 
T9!T12|V18|R21 S 
N4)K9 |A18|H3 3 S 
S101F191D2 4 S 4 
IlljRl7|Al8 S 4 
V11(R12|Q17 S 4 
S4|T9|V18(R21 $ 
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HIV output: 2/6 



66 


I 0.17 


67 


1 0.17 


66 
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69 


| 0.18 j 


70 


j 0.18 | 
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| 0.18 j 


72 
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74 
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75 


1 0.19 | 


76 
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0.20 j 


79 


0.20 j 


80 
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81 
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0.21 j 
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0.21 j 


84 j 


0.22 | 


85 


0.22 | 


86 j 


0.22 | 


87 j 


0.22 j 


88 | 


0.23 | 


89 | 


0.23 j 


90 j 


0.23 | 


91 j 


0.23 j 


92 | 


0. 24 | 


93 | 


0.24 | 


94 | 


0 . 24 j 


95 | 


0.24 j 


96 | 


0.25 | 
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0.25 | 


98 | 


0.25 | 


99 | 


0.26 | 


100 | 


0.26 | 


101 | 


0.26 | 


102 | 


0.26 | 


103 | 


0.27 j 


104 j 
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105 | 
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106 j 
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109 | 


0. 28 | 
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0.23 j 
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0.29 J 
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0.29 ! 
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0 .29 | 


114 | 


0.29 | 


115 | 


0 . 30 j 


116 I 


0.30 t 


117 


0-30 I 


118 ! 


0.30 | 


119 


0.31 | 


120 


0.31 | 


121 


0.31 j 


122 


0 . 31 I 


123 


0 .32 j 


124 


0.32 j 


125 


1 0.32 ( 


126 


0.32 | 


127 


1 0.33 | 


128 


t 0.33 i 


129 


1 0-33 I 


130 


j 0.34 | 
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t 0-3* I 



V13|V15|I19 S 4 5095 & 4.314940 & 
R12|Q17|D24 S 4 5088 4 122.811489 
S4|T9|V18 $ 4 5671 4 868.949180 & 
G24|E28 $ 4 5363 4 579.114112 4 0 
S4|T9|R21 $ 4 5425 & 650.238601 4 
K9|I11|R17 $ & 5315 & 590.615207 & 



0.244059 & Vcr 
4 0.261410 4 \cr 
0.285090 4 \cr 
237805 & Vcr 
0.289174 4 \cr 
0.296804 4 Vcr 



V18|K31|Y33 S 4 5524 & 852.751002 4 0.304979 4 Vcr 
T21|E24 $ & 19192 & 14557.811161 & 0.310756 4 Vcr 
S4|T9|T12|V18|R21|K31|Y33 $ 4 4390 & 0.004904 4 0.350351 & \cr 
S4|T9 |V18|R21|Y33 $ 4 4341 4 1.910712 4 0.358927 4 \cr 
I11(H12|A18 $ 4 5225 4 890.707158 4 0.359740 & Vcr 
4 $ H12|L13 $ 4 9314 4 5009.363342 & 0.364791 4 Vcr 
& $ M1|S12|F20 $ 4 4243 4 17.800459 & 0.378494 4 Vcr 
$ Y4|Tll|-23 $ & 4876 4 710.489341 4 0.388952 4 Vcr 
$ H12|A18|H33 $ 4 5292 4 1141.301814 & 0.391569 4 \cr 
$ N12|G24|L25 $ & 4169 4 18.987442 & 0.391690 4 Vcr 
$ N12|T13 $ & 5365 4 1255.021021 4 0.398803 & Vcr 
S N4|K9|G23 $ & 9804 4 5726.074196 & 0.404540 & Vcr 
$ P12|L13|W19|024 $ 4 4070 4 20.998880 4 0.409748 4 Vcr 
$ 012 1 Y13 |T15 |G17(V26 $ & 4024 4 0.000255 4 0.414274 4 Vcr 
S S10|F19|A21 $ 4 5598 4 1607.067572 4 0.420292 4 Vcr 
$ K9|H12 $ & 26788 & 22912, 753S61 4 0.441631 4 Vcr 
$ S10|Q17|D24 $ & 3960 & 93.803024 4 0.443318 & Vcr 
$ Q17|A21|D24 S 4 3949 4 133.098101 & 0.452738 & Vcr 
$ K4|K9|H12 $ 4 4239 4 450.472945 & 0.457896 4 Vcr 
$ T9|T12 |V18)R21 |Y33 $ 4 3784 & 1.646276 4 0.459063 & Vcr 
$ Y4|Q9|T11 $ & 4402 4 639.401728 & 0.462612 & Vcr 
3 N4|K9|R17 $ & 4239 4 507.820002 & 0.468770 4 Vcr 
$ N4|H12(A18 $ 4 4450 & 726.677198 & 0.470266 4 Vcr 
$ 09|T11|L19 $ 4 4413 & 691.708041 4 0.470653 & Vcr 
$ S4jT9|T12|R21 $ & 3747 & 31.482325 & 0.471755 4 Vcr 
$ N12|S30 $ & 4440 4 766.347625 & 0.479764 & Vcr 
$ Il|52 $ 4 3970 4 345.480880 4 0.489218 & Vcr 
$ S4 |T12 |V18|R21 $ & 3643 4 32.472859 & 0.491921 4 Vcr 
$ 09|Tll|-23 $ & 4299 & 742.828036 4 0.502461 4 Vcr 
T21|Q31 $ & 16089 & 12621-469597 4 0.519777 & \cr 
$ K9 |A18 |Q31 JH33 $ 4 4160 4 697.083962 & 0.520683 4 Vcr 
S S4|T9 |R21 |Y33 $ 4 3460 4 35.030271 & 0.528142 & Vcr 
$ Y4t09|Tll|L19|-23 $ 4 3425 4 1.824291 & 0.528495 4 Vcr 
S 36jK7|T10|Lll|M13|K16iG26lY28 S 4 3409 & 0.000000 6- 0 531288 4 \cr 
$ S4 |T9|V18 |R21|K31 S & 3406 & 1 . 8600S7 4 0.532246 4 Vcr 
S S17|X"!9 $ & 4910 4 1510. 151983 & 0.533093 & Vcr 
$ Y12|H20|R24 $ & 3401 4 29-556849 & 0.538702 4 V C r 
$ S4|T9 |V18 |R21|K31iY33 5 4 3370 & 0.100690 & 0.539008 4 Vcr 
S S10|Q17 $ £ 22065 & 18738.120311 4 0.547525 4 Vcr 
5 All-22|S23 $ 4 33C3 & 7.355264 4 0.553724 & Vcr 
$ H13|W15|E31 S & 3339 4 56.771417 & 0.556389 & Vcr 
$ *24 | *25 I *26| «27 i *28 t *29 | *30| *31 | *32 | »33 $ & 3269 4 
$ R17|H33 $ 4 31466 & 28229.156188 4 0.56S421 4 \cr 
$ K13|W15lT13 S 4 3501 &. 360.659791 f= 0-584679 4 \cr 
$ F13i-22|£23 S t 3122 & 6.681355 U C . 589480 f. Vcr 
3 R17|A18|T21 $ 4 3190 & 89. 245042 & 0.592592 & \cr 
S N4|K9|A18|031iH33 $ & 3143 4 55.455693 & 0.594235 & Vcr 
$ R17 | A18 |Q31|H33 $ & 3144 & 1C1. 027645 i 0-604153 4 Vcr 
$ VI |N23 | *24 | *25( *26| *27 | *28| *29 | *30 1 *31| *32 I *33 $ 4 3030 
S A11JN12 S 4 4517 4 1452.835945 4 0.607916 4 Vcr 
S R12$Tie|A21 3 4 3150 4 134.485293 4 0.609647 & \cr 
S S10|G23|D24 $ 4 3606 4 599.551395 4 0.611461 4 Vcr 
$ S1|M13|W15 $ 4 3087 4 91.193028 4 0.613590 4 Vcr 
S N12|F20|K24 $ 4 3202 4 213.735139 4 0.615099 4 Vcr 
S K13lW15jE24 S 4 3282 4 306.430052 4 0.617635 5. Vcr 
S K9 | 111 |F19 1G23 S 4 4153 4 1180.595212 4 0.618272 4 Vcr 
$ R2 | P3 jN5|N6lT7 |RB|G14 | P15 |G1G j Y20 (TZ2 |G23 | 125 | 126 |G27 | 129 1R30(A32 $ 
$ H12|A18|Q31 S 4 3759 4 845.163446 4 0.629981 4 >cr 
S Kl?|D20|-23 S 4 2928 4 25.438797 4 0.632234 4 Vcr 
. $ Y5|K7iR10|K23|N24|T28 $ 4 2897 4 0.000008 4 0.633345 4 Vcr 
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4 $ G10|R17 S 4 9506 4 6637.164691 4 0.638967 4 \cr 

4 S Y4|Q9|-23 $ 4 3539 4 699.676594 4 0.644852 4 Vcr 

4 $ G12|A22|D23|N24 $ & 2838 4 0.092735 4 0.645134 4 \cr 

4 $ Tl|R2|P3|NS|N6|T7|R8|G14|P15|G16|Y20|T22|l2S|I26|C27|I29|R30|A32 S 4 3787 & 
4 $ Tl|R2|P3|N5|N6|T7|R8tG14|P15|G16|Y20|T22|G23ll25|I26|G27tr29|R30|A32 S 4 3C 
& $ N4|H12 $ & 26775 & 23945.157075 & 0.646741 & \cr 

& $ V18 1 *24 j*25|*26|*27|*28|*29|*30|*31|*32f-33 5 & 2775 & 0.000000 & 0.657651 

& $ A1|R9|-22|S23 $ & 2763 & 0.413224 & 0.660115 & Vcr 

& $ Y6|Y12|F13|H20|A22 |K24 $ 4 2761 & 0.000052 4 0.660430 & \cr 

& $ V11|A24|S28 $ 4 2788 & 31.535138 4 0.661330 4 \cr 

& $ V20|.X22.|-24|K25|M29 $ 4 2748 4 0.000084 4 0.663009 4 Vcr 

& $ K9|H12|A18 $ 4 3267 & 526.343072 4 0.664465 4 Vcr 

& $ T9|T12'|V18jR2lj'K3'l $ 4 2742 4 i. 602621 4 0.664517 4 \cr 

& $ T8 j R9 $ & 318S 4 445.441758 4 0.664683 4 \cr 

& $ I11|H12|R17 $ 4 2909 4 172.776969 4 0.665344 4 Vcr 

& $ Y6|X10|X12|M13 |X1B |X19|R31 $ 4 2736 4 0.000000 4 0.665388 & Vcr 

& $ A24|S28 $ 4 3300 & 566.063943 4 0.665797 & Vcr 

& $ G12(T18|A22|D23 |N24 $ 4 2692 4 0.005083 4 0.674094 & Vcr 

& $ P12|W19|Q24 $ 4 3054 4 395.340658 4 0.680669 4 Vcr 

4 $ A14|H20|N24 $ 4 2697 4 47.434702 4 0.682460 4 Vcr 

4 $ T9 | Tl 2 | VI 8 | R21 1 K3 1 | Y3 3 $ 4 2632 4 0.086760 4 0.685931 4 Vcr 

4 $ R12|017|A21 $ 4 2701 4 79.045229 4 0.687837 4 \cr 

& $ R17|A18|H33 S 4 3944 4 1325.820339 4 0.688628 4 Vcr 

4 $ W15|I19|A24 $ 4 2655 4 56.335384 4 0.692552 4 Vcr 

4. S Q12|R13|V20|I22|K24 |-26|M29 $ 4 2584 4 0.000000 4 0.695324 4 Vcr 

4 $ SI | Y4 |-6|N10| Yll |S12|S15|V21|K24 $ 4 2554 4 0.000000 4 0.701181 & \cr 

$ T18|A21 $ 4 6883 4 4332.151205 4 0.701796 4 Vcr 

$ K17|D20 S 4 2996 4 4S8.83SS71 4 0.704460 4 Vcr 

$ Q17|D24|K31 $ 4 2660 4 125.180912 4 0.704916 4 Vcr 

$ L13|Q15|W19 $ 4 2582 & 98.222466 4 0.714812 4 Vcr 

$ S4|T9|R21fK31|Y33 $ 4 2474 4 1.844223 4 0.717056 4 Vcr 

$ Il|G4jMll|Pie|R22|-24|V25 S 4 2445 4 0.000002 4 0.722286 4 Vcr 

$ S12JF13 $ 4 4939 4 2S02 . 178252 4 0 .723857 4 Vcr 

S L13|Q17|K31 $ 4 2663 & 227.467572 4 0.724104 4 Vcr 

$ K9|R17|H33 $ 4 3142 4 710.504406 4 0.724879 4 Vcr 

$ P12|L13|W19 $ 4 2907 & 483.231131 & 0.726360 4 Vcr 

$ K9|R17|A18 $ 4 3012 4 59B. 308696 4 0.728290 4 \cr 

$ S4|T12|R21 $ 4 3010 4 597.264141 4 0.728473 4 Vcr 

$ N4|I11|R17 $ 4 3233 4 820.559839 4 0.728529 4 Vcr 

$ M13|A24lE31 $ 4 2426 4 50.435156 4 0.735563 4 Vcr 

$ L2 |A12|T18|V19|D23 |R24 $ 4 2374 & 0.0001C4 & 0.735861 4 \cr 

$ K9|A2l|H33 $ 4 3269 4 897.012220 4 0.735243 & Vcr 

$ R2|P3 |N5|N6iT7*Re |G14 | P1S1C16|F19|Y20|T22|G23|I25(I26|G27 | T29 |R30jA32 5 4 : 

$ RIO j Kll ( S12 | V2 5 $ 4 2345 4 0.448221 4 0.741446 4 Vcr 

$ N4 |K9|I11|G23 $ 4 2883 4 541.944923 4 0 742108 4 Vcr 

$ R17|A18|Q31 S 4 33G4 4 973.769536 4 0.744153 4 Vcr 

S Y4 |Q9 |T11 1 F13 | Y19 1 - 2 3 $ 4 2321 4 C.C09829 & 0.745895 4 Vcr 

$ I7|F20|Q33 S 4 2355 4 36.678004 4 0.74677$ i Vcr 

$ T9|V18|K31(Y33 S 4 2352 4 43.522103 4 0.748251 4 Vcr 

$ L3|A12|V19|D23|R24 $ 4 2307 4 0.001890 & 0.748529 & Vcr 

$ G4tMll|P18 $ 4 2306 4 12.419975 4 C. 751048 k Vcr 

$ S4 |T12 |V18|R21 |Y33 $ 4 2292 4 1.757250 & 0.751673 i Vcr 

S H12lR17|T21 S 4 2417 4 129.651999 4 0.752215 4 Vcr 

$ RIO j S12 1 W19 tQ2 4 $ 4 2299 4 14.238983 4 0.752700 & Vcr 

S O4|E6|I7|Lll|C12|V13;V20|A22lT24t-2StA26iT29jG-33 $ & 2279 & 0.C00000 i C.7* 
3 G101724 S 4 2727 4 449.008967 4 0.753966 4 Vcr 
$ V19 jR24iV26|L31 $ 4 2272 & 0.404088 4 C. 755161 4 Vcr 
$ Vll|R12|017|ri8 $ 4 2281 4 11.909386 b 0.755629 4 Vcr 
$ I,13|W15|Q24 |E28 $ 4 2270 4 0.994080 4 0.755644 4 Vcr 

T1|R2|P3|N5|N6!T7 |R8 |G14|P15|G16|F19 |Y20|T22|G23 1 125 \ 126 1 G27 ( 129 j R30 | A32 S 
R17|T21|E24 S 4 2366 4 123.762808 4 0.760627 4 Vcr 
4 2610 4 372.687253 4 G. 761541 4 Vcr 
4 2455 4 218.504333 4 0.761692 4 Vcr 
4 2336 4 100.386181 4 0.761856 4 Vcr 
4 3105 4 885.799386 4 0.764893 4 Vcr 
4 S Mll|I15jG18lOJ9|T20|F21|H22|A24 $ 4 2218 4 0.000000 U 0.765115 4 Vcr 



S 
$ 

S Ml3lW15|N28 $ 
$ M13(K17|V26 S 
$ Ml3jQ15iG24 S 
$ F19|G23|D24 $ 
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S R9|A14|H20|N24 $ 4 2214 4 2.664734 & 0.766345 & \cr 
5 S4 |T9|V18 |K31 $ 4 22S3 4 45.367685 & 0.767028 4 Vcr 
S Y4(Tll|L19|-23 $ 4 2222 4 36.549243 4 0.771106 4 Vcr 
S K9|A18|H33 S 4 9877 4 7701.775158 4 0.772980 4 Vcr 
$ T9|S18|H20 S 4 2246 4 73.652930 & 0.773506 4 Vcr 
S G10|S17|I19 $ 4 2252 4 85.340274 4 0.774546 4 Vcr 
S T12|F13|A14 $ & 2217 4 52.732117 & 0.774983 4 Vcr 
S N4|A21|H33 $ 4 3446 4 1316.842768 4 0.781367 4 \cr 
$ N9 |Rl0|S12|H2O|K23)O24 $ 4 2120 4 0.000164 4 0.783023 4 Vcr 
$ R12|T18|D24 $ 4 2373 4 258.082710 4 0.783941 4 \cr 
S T21|H33 S 4 13722 4 11609.274530 & 0.784336 4 Vcr 
$ T9|V18|R21 $ 4 2733 4 627.5011S1 4 0.785638 4 Vcr 
$ L13|K17|V19 S 4 2223 & 123.584647 & 0.786733 & Vcr 
$ T9|V18|Y33 $ 4 2928 4 837.332414 & 0.788304 4 Vcr 
4 $ E1|Q4|I5|D6|I7|08|E9|-10|M11|M16|A17|-18|W19|S21|M22|I24|G25|G26|T27|S28|S2 
4 $ Y5|K7|K23|N24|T28 S 4 2075 4 0.000148 & 0.791109 4 Vcr 
4 $ S4|W15|I19|A24 $ 4 2071 & 3.211998 4 0.792396 & Vcr 
4 $ N4|K9|A18|031 $ & 2414 4 350.433693 & 0.793149 4 Vcr 
S X12|L13|N24 S & 2091 & 32.654961 4 0.794078 4 Vcr 
$ S8|R10|S12|R20|-23|K24 $ 4 2056 4 0.001995 4 0.794496 4 Vcr 
$ S10|A21|D24 $ 4 2195 4 141.715389 & 0.794978 4 Vcr 
$ T5|K6|K7|I8|H10|G24|M26 $ 4 2049 4 0.000000 4 0.V95739 4 \cr 
S T9|V18|K31 S 4 2861 4 816.350692 4 0.796510 4 Vcr 
$ I20|A22 |T23|K24 S 4 2040 4 0.135518 4 0.797358 4 Vcr 
$ Y3 |A4|-21 |N23 $ 4 2039 4 0.001752 4 0.797511 4 Vcr 
$ G4 |W11|D23 |G24 $ 4 2039 4 0.335758 4 0.797570 4 Vcr 
$ I11|E24 $ 4 46:4 4 2601.997572 4 0.798748 4 Vcr 
S G10|G17|G24 $ 4 2157 4 138.303116 4 0.801095 4 Vcr 
$ Y6|S8|R10!A15|R16|K22(K24 $ 4 2011 4 0.000000 4 0.802448 4 Vcr 
S S4|T9|R21|K31 S ft 2043 4 34.105157 4 0.802818 4 Vcr 
4 $ D4|E6Jl7|R9|Lll|Q12|V13|V20|A22|T24|-25|A26|T29iQ33 S 4 1999 4 0.000000 4 
4 $ 09|T11|L19(-23|K24 $ 4 1990 4 1.569195 4 0.806400 4 Vcr 
4 S S8|P12|X24 $ 4 2000 4 16.301963 4 0.807225 4 Vcr 
4 $ S4|T9|T12|R21|Y33 $ 4 1985 4 1.703737 4 0.807295 4 Vcr 
4 S R10|Y12|V19|Q24(R31 $ 4 1982 4 0.043977 4 0.807529 4 Vcr 
4 $ T4 |Q9[F20|K23 |G24 $ 4 1979 4 0.004973 4 0.303044 4 Vcr 
4 $ L11|S12|L13|V26 S 4 1972 4 4.533048 fa 0.810047 4 Vcr 
4 5 T5|K6(K7l 18 1 H9 1 HlO | G24 | M26 S 4 1967 & 0.000000 & 0.81G12B & vcr 
4 $ S6fK.7|T10tLll|K16|G26lY28 S 4 1956 4 0.000000 4 0.812033 4 \cr 
4 $ T9(V13|R2\| Y33 S 4 1983 4 33.83957G 4 0.813214 4 Vcr 

4 $ R2|P3|N5|N6|T7|R8|G14|P15|G16|Y20|T22|G23|I25|I26|G271D28II.29|R30|A32 S i 
4 S F19}A21(D24 3 4 2034 4 139.045764 4 0.813940 4 \cr 
S L1|M13|W15 $ 4 1949 4 9.905173 4 0.814948 4 Vcr 

$ Q9jTli;Q12jLl9lF20|K22jT23!R24 $ 4 1933 4 0.000000 4 0.815996 4 Vcr 
$ F19|A2I|G23 $ 4 4336 4 2404.^25279 4 0.816257 4 Vcr 
$ H12|R17|E24 $ 4 2006 4 91.386294 4 0.819143 4 \cr 
$ L13|WlStV26 $ 4 2149 4 237.129335 4 0.819611 4 Vcr 
$ N12lWl9|-23|N24 S 4 1909 4 7.808759 4 0.82X430 * W 
4 S Tl|R2|P3|NS|N6|T7|R8|G14|?l5|G15lY2C|T22;i25|G27|I29|R3C|A32 S 
4 S T21|024 5 i. 77^3 4 5882.829451 4 0.823300 4 Vcr 
4 S G4|V11|R12 $ 4 2497 4 605.660868 4 0.323G10 4 \cr 
4 $ Q17|K31|Y33 $ & 2149 4 263.756833 4 0.824134 £ \cr 
4 $ K25|K26 $ 4 2095 4 217.9C6699 4 0.825341 & Vcr 
4 $ T21',Q31|H33 $ 4 2236 4 361.582066 4 0.825961 4 \cr 
4 $ T12|V18|R21 S 4 2446 4 576.530816 4 0.626794 t Vcr 
4 $ Y4|Q9|TI1|L19|-23|N24 $ 4 1869 4 0.047981 4 0.826881 4 Vcr 
$ H12|Q?1|H33 5 4 2S22 i 1055.347497 4 0.827258 4 Vcr 

$ R9|Mll|H5|GlBiQl9lT20|F21|H22|A24 S 4 1365 4 0.000000 4. 0.827546 4 Vcr 
$ VllTiB|N23| '24 j *25| *26| *27| *28| *29l '301 *31',*32l *33 $ & 16SS & 0.0O00OC :. 
$ G4 |L13 |W151A24 $ 4 1866 4 7.795250 4 0.828S68 4 Vcr 
$ P12JT21 S 4 11256 4 9405.119546 4 0.829912 4 \cr 
S T9|K3ijY33 S 4 2569 4 323.122139 4 0.830747 4 \cr 
$ K7 (A14 i A24 |V3'^ 3 4 1B44 4 0.130705 4 0.831083 4 Vcr 

3 Q9 jTll I 119 | L22 1T23 | K24 |V2S|V26 $ 4 1841 4 0.000000 4 0.331561 4 \cr 
4 $ R9)K17|G24 S 4 2157 4 318.307130 4 0.831945 4 Vcr 
4 S A14(G24 S 4 5018 4 3183.960671 4 0.832719 4 Vcr 
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264 | 0.68 | 4 $ S8|R10|S12(V20|A22 (R23 $ & 1834 4 0.000248 4 0.832726 4 \cr 

265 | 0.68 I 4 S L1X|S12|V26 S 4 1919 4 85.184287 4 0.832757 & Vcr 

266 j 0.69 [ 4 $ R2|P3|N5|N6|T7|R8|S10|G14|P15|C16|F19|Y20|T22|G23|I25(I26|G27|I29|R30|A32 S 

267 I 0.69 | 4 S R12|T18|R31 S & 2336 & 510.426023 4 0.834125 4 Vcr 

268 | 0.69 | 4 S V11|R12|D24 S 4 2089 & 265.729908 & 0.834506 4 \cr 

269 j 0.69 | 4 S H12|A18|T21 $ & 1899 & 83.244349 4 0.835749 & Vcr 

270 | 0.70 | & S R12|:)24 $ 4 12816 & 11016.724531 4 0.838463 & \cr 

271 j 0.70 1 4 S D5|Q24|N28 $ & 18S6 4 57.570055 & 0.838602 & Vcr 

272 | 0.70 | 4 $ V11|A21 $ & 6489 & 4695.336767 4 0.839384 4 \cr 

273 | 0.70 | 4 $ T9|T12|R21 $ & 2344 & 558.736361 4 O.r 0758 4 \cr 

274 | 0.71 | 4 $ G10|L13|W19|Q24 $ 4 180S & 21,432425 4 0.841035 4 \cr 

275 | 0.71 | & S S4|X10 $ & 2670 & 889.423177 & 0.841523 & \cr 

276 | 0.71 J 4 C GS|L19|-23 jN24|V25 S £ 1781 * 0.555631 4 0.841545 4 Vcr 

277 J 0.71 | 4 $ N12|W19|N24 $ 4 1920 & 140.804894 4 0.841748 4 \cr 

278 | 0.72 j 4 $ H4(R9|T12 $ & 1843 & 63.837063 & 0.841754 4 Vcr 

279 j 0.72 | 4 S R10|S12|S19|O24 $ 4 1775 4 2.690479 4 0.842869 4 Vcr 

280 | 0.72 | 4 $ M13|K17|T18 $ 4 2153 4 386.143038 4 0.843755 4 Vcr 

281 j 0.72 [ 4 $ R17|T21|Q31 $ 4 1850 4 91.352915 4 0.845085 4 Vcr 

282 j 0.73 | 4 $ S8|X24 $ & 2047 4 293.960536 4 0.845991 4 \cr 

283 j 0.73 | 4 $ Y4|Q9|R10|T11|L19|-23[R24 $ & 1749 4 0.003580 4 0.846644 4 \cr 

284 | 0.73 | 4 $ Il|E3|l4|A5|E6|V7|Q8|D9|-10|yi2(T13)M16|-18|W19|R20|S21|M22tL23;K24jR25|S26 

285 | 0.73 | 4 $ D5|Q24 $ & 2759 4 1018.937453 4 0.848081 4 \cr 

286 | 0.74 | 4 $ T1|R2|P3|N5|N6|T7|R8|G14|P15|G16|Y20|T22|I251I26|G27|D29|I29|R30|A32 $4 1' 

287 | 0.74 | 4 $ K9|H12|R17 $ 4 1894 4 166.406797 4 0.850079 4 Vcr 

288 | 0.74 | 4 $ Vll|gi7|D24 $ 4 1S12 4 93.103584 4 0.8S1467 4 \cr 

289 1 0.74 | 4 $ A1|M11 $ 4 2729 4 1013.253017 4 0.851968 4 Vcr 

290 | 0.75 j 4 $ H12|A18)Q31|H33 $ 4 1795 4 85.511085 4 0.852963 4 Vcr 

291 j 0.75 | 4 3 L19|S22|V26 S 4 1757 4 49.527891 4 0.853283 4 Vcr 

292 | 0.75 | 4 $ R9|N12|M13 $ 4 2146 4 444.904483 4 0.854292 4 Vcr 

293 I 0.76 | 4 $ Q6|M13 |W15|H20|N22 $ 4 1695 4 0.0G5297 & 0.855256 4 Vcr 

294 ( 0.76 [ 4 $ Y4|T11|L19 $ & 2355 4 661.700054 4 0.855524 4 Vcr 

295 j 0.76 | 4 $ I19|-23|V25 $ 4 1827 4 134.704350 4 0.855682 4 Vcr 

296 | 0.76 | 4 S T9|V18|R21|K31|Y33 £ 4 1692 4 1.761938 4 0. 856009 4 Vcr 

297 j 0.77 | 4 $ S15 |-21 | -22 | -24 $ 4 1664 4 0.060460 4 0.860125 4 Vcr 

298 | 0.77 | & $ X12|N24 $ 4 2272 4 614.028472 & 0.861054 4 Vcr 

299 j 0-77 I 4 $ A1|M11|T18 S 4 1713 4 55.464572 4 0.861122 4 Vcr 

300 | 0.77 j 4 $ V20|I22|-24|K25|N28(H29 $ 4 1657 4 0.000005 4 0.861205 4 Vcr 

301 | 0.78 j 4 $ M9|N28 $ 4 2C94 4 448.607779 4 C. 863034 4 Vcr 

302 | 0.78 | 4 S A21|Q31|H33 $ 4 3220 4 1574.859557 4 0.863042 4 Vcr 

303 j 0.78 j 4 S Q12|R13|V20|122lK21| -26|N28|M2i> S 4 1645 & 0. 000000 4 0.863064 4 \cr 

304 | 0.78 | 4 $ L3jT9 S 4 3676 4 20S1.224C46 4 0.863083 4 Vcr 

305 I 0.79 | 4 $ D24IK31 $ 4 12967 4 11324.565776 4 0.863460 4 \cr 

306 1 0 79 ! 4 $ L13(K31|Y33 $ 4 2465 4 827.871734 4 0. 864278 & Vcr 

307 | 0.79 j 4 $ L19|T21|-23 S 4 1949 4 312.399933 4 0.864436 4 Vcr 

308 I 0.79 | 4 $ G4|E?lRLO|S12|Z20|R22|Q24 3 4 1633 4 0.000021 4 0.364913 u Vcr 

309 I 0.80 | 4 $ G4{W15|A24 S & 1765 4 133.J49343 4 0.865121 4 Vcr 

310 ( O.eO I 4 $ TltR2tP3|N5|N6|T7|R8|G14 I PlS|G16tY20|T22|G23| I25J I26iG27 JD28J I2S |R30|A32 S 

311 | 0.80 1 4 $ N4|R17|A18 S 4 2464 4 833.7180BC 4 0.865331 4 Vcr 

312 | O.dO | 4 $ Y4lT5tK6|H9|-10|-ll!-12i-13lRi4|A15 jCI6|G17|R18|Al9|V;20(W21|T23iG24 IT26 » 

313 j 0.81 j 4 $ M13|S17|I19 $ 4 16?7 * 69.061642 4 0.865691 4 Vcr 

314 ! C.81 | 4 $ I13IR171Q31 $ U 285'? 4 1234.2C8538 4 0.866479 L Vcr 

315 j 0.81 1 4 $ S1|V11|V2S 5 4 1725 4 114.913066 fc 0.568418 4 Vcr 

316 | 0.31 | 4 $ Q9(L19|-23 $ 4 236^ 4 71-8.375850 4 0.868641 4 Vcr 

317 J 0.82 i 4 S A121V19|R24 $ u 1625 4 17.183947 4 0.868764 4 Vcr 
316 t 0.S2 1 4 S X8|P9|X13|X31 $ 4 1606 4 0.000107 & C. 869040 4 Vcr 

319 | 0.82 | 4 $ P12|A14|S17 jW19!F20 $ 4 16C5 4 G. 078275 4 0. 869203 4 \cr 

320 | 0.62 j 4 5 EliKS |T12' ; D20{ Y22 |G24 $ 4 1602 4 0.000007 & 0.86S647 & \cr 

321 | 0.83 | 4 $ TlliL19|-23 $ 4 2366 4 770.136365 4 0.870576 4 Vcr 

322 j 0.33 | 4 S I1|R9|I12 $ 4 1615 4 19.398937 4 0.870616 4 Vcr 

323 | 0.83 j 4 S H3(R12 3 4 2021 4 425.578936 4 0.870643 4 Vcr 

324 | 0.84 | 4 $ I11I12 $ 4 1939 4 345.480880 4 0.870930 4 Vcr 

325 i 0.84 j 4 S R9 | SI 5 1 -21 | -2 2 | -23 | -24 S 4 1S92 4 0.000188 4 G.S71160 4 Vcr 

326 1 0.84 I 4 $ A12|T18|V19|R24 $ k 1583 4 0.942061 4 0.872657 & \cr 

327 I 0.84 | 4 $ W15|Q24(E28 S 4 1596 4 18.677870 4 0.873368 4 Vcr 

328 | 0.85 | 4 $ Y12|V19|024|R31 5 4 1578 4 0.826255 4 0.873390 4 Vcr 

329 | 0.85 | 4 S El|G4|l5|D6lI7J08|E9|-10|Ml6|Al7l-:S|W19|S2i|M22|L24|G25|G26tT27lS28|S29tA3 
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330 | 0.85 | 4 S G8|L13|A14|C18|H20 $ 4 1S75 4 0.000857 4 0.873716 4 \cr 

331 | 0.85 | 4 $ M11|R12|T18 S 4 1784 & 214.819937 & 0.874586 4 \cr 

332 | 0.86 | 4 S Y4|I8|Q9|T11|Y19|I23|S24|V25 $ 4 1569 & 0.000000 4 0.874613 4 \cr 

333 | 0.86 | 4 $ S4|K7|T9|A14|A24|V32 $ 4 1568 4 0.000391 & 0.874763 & \cr 

334 | 0.86 | & $ G10|V19|R24|V26(L31 $ & 1567 & 0.022878 & 0.874915 4 \cr 

335 | 0.86 | 4 $ A18|T21|H33 $ 4 1950 4 386.091755 4 0.875373 & \cr 

336 | 0.87 | 4 $ T9|V18|R21|K31 S & 1595 4 32.945102 4 0.875649 4 \cr 

337 | 0.87 } 4 $ N12|N28|E31 $ & 1644 4 84.308230 & 0.876001 4 \cr 

338 | 0.87 | 4 $ N4|K9|F19|G23 $ 4 2413 4 853.445657 4 0.876021 4 \cr 

339 | 0.87 | 4 $ S15|-21|-22|-23|-24 $ 4 1S50 4 0.003354 4 0.877439 4 \cr 

340 | 0.88 | 4 S V15|P18|R21|V23 |V26 $ 4 1543 4 0.000396 4 0.878473 4 \cr 

341 | 0.88 ) 4 $ V11|R12|A21 $ i 1677 4 139.0S6915 4 0.879218 4 \cr 

342 | 0.88 j 4 S S4|K31|Y33 $ 4 2429 4 891.874986 4 0.879339 & \cr 

343 | 0.88 | 4 $ Y4 |09|T11|L19 $ 4 1566 4 32.897928 4 0.879930 4 \cr 

344 | 0.89 | 4 S A14|S17|W19|F20 $ 4 1534 4 1.410979 4 0.880005 4 \cr 

345 i 0.89 | 4 S G4|R9|F20|T26 $ 4 1540 4 12.834655 4 0.880800 & \cr 

346 | 0.89 | 4 $ Y6|X10|X12|X18|X19|R31 $ 4 1525 4 0.000000 4 0.881117 4 \cr 

347 | 0.89 | 4 $ L1|S4|M13|W1S S 4 1525 4 0.559824 4 0.881199 4 \cr 

348 | 0.90 | 4 $ N4|K9|A21|H33 S 4 1568 4 48.227041 4 0.881881 & \cr 

349 | 0.90 | 4 $ T9|V11|R22 $ 4 1769 4 253.858213 4 0.882556 4 \cr 

350 | 0.90 | 4 S Y6|G8|R10|L11|S12|V20|R23|K24 $ 4 1515 4 0.000000 4 0.882576 4 \cr 

351 | 0.90 | 4 $ X4|K31 $ 4 1926 4 418.267274 4 0.883632 4 \cr 

352 | 0.91 | 4 $ P12|D23|N24 S 4 1623 4 115.955006 4 0.883732 4 \cr 

353 | 0.91 | 4 S Q9|K23 S 4 4487 4 2986.952760 4 0.884744 4 \cr 

354 j 0.91 | 4 S G4|R9|M13|W15 $ 4 1511 4 14.154263 4 0.885206 4 \cr 

355 | 0.91 | 4 $ R2|P3|N5|N6|T7|R8|I13|G14|P15|G16|F19|Y20|T22|G23|I25|I26|G27|I29|R30|A32 S 

356 | 0.92 | 4 S V13|W15|L19 $ 4 1573 4 83.913345 4 0.S86323 4 \cr 

357 | 0.92 | 4 $ P12|S30 $ 4 2786 4 1298.604637 4 0.886566 4 \cr 

358 | 0.92 | 4 $ Vl|R12)T18|N23|*24f25|*26|*27C28|*29|*30|*31l*32|*33 $ & 1487 4 C.00000G 

359 | 0.93 | 4 S Q17[D24|Y33 $ 4 1608 4 121.315232 4 0.886668 & \cr 

360 ( 0.93 | 4 $ E9|R12|T18 $ & 1614 4 133.600703 4 0.887S68 4 \cr 

361 | 0.93 | 4 $ G4|R12|T18 $ 4 2078 4 597.832556 4 0.887601 4 \cr 

362 ) 0.93 | 4 $ H4|P12 $ 4 2777 & 1298.604637 4 0.887855 4 \cr 

363 | 0.94 | 4 $ T1|R2|P3|M5|M6|T7|R8|C14|C16|Y20|T22|C23|I2S|I26|G27|I29|R30|A32 S & 1925 4 

364 | 0.94 | 4 S S4|T9|N12|V18|R21 $ 4 1474 4 1.158008 4 0.888647 4 \cr 

365 | 0.94 | 4 $ W19|X24(T26 S 4 1724 4 252.469231 4 0.888834 4 \cr 

366 j 0.94 j 4 S A1|E9 $ 4 2089 4 630.489943 4 0,890631 4 \cr 

367 | 0.95 | 4 S A1|G4!F20(A24 $ 4 1455 4 2.044033 4 0.891465 4 \cr 

368 | 0.95 | 4 S T9|C12 |-22|G24 S 4 1450 4 0.214607 & 0.891912 4 \cr 

369 | 0.95 1 4 $ Y4|Q9lTll|Y19|W20|-23|N24 $ & 144-7 * 0.000061 4 0.892*04 & \cr 

370 J 0.95 j 4 S CIO I Ml 3 | A24 | E31 $ 4 1446 4 2.S55158 4 C.6S2845 4 \cr 

371 I 0.96 j 4 $ S10|D24|I26 S 4 2229 4 789.361165 4 0.893336 4 \cr 

372 | 0.96 I 4 S G4|M13|W15 S 4 1691 4 252.133281 4 0.393444 4 vcr 

373 j 0.96 j 4 S N121E31 $ 4 2929 4 1492.835945 4 0.893922 4 \cr 

374 j 0.96 j 4 $ T12|F13|A14|N28 $ 4 1436 4 2.983773 4 0.894262 4 \cr 

375 j 0.97 j t S S4|T12jV18|R21|K31 $ 4 1434 i 1.7J0658 4 0.894363 4 

376 1 0.97 j 4 S S8|G10|X24 S 4 1444 4 16.637502 4 0.895049 4 \cr 

377 ! C.97 I 4 S Q9|Tll|L19|-23|K24tR31 $ 4 1427 4 0.051495 4 0 3951*6 & xcr 

378 i C.97 I 4 $ M13|S17|N28 S 4 1661 & 239. 34 ^729 4 0.395842 \cr 

379 I 0.98 I 4 $ RlOj Y12 1V19|E23(Q24 S 4 1420 4 0.021913 4 C. 8960*M 4 Vcr 

380 I 0.98 I 4 S R12II13 S 4 2328 4 90S.2G3714 4 0.396110 4 \cr 

381 I O.S8 I 4 $ G10| I20|A22|T23|K24 S 4 1415 4 0.007673 f, 0.396763 i \~z 
332 I 0.98 [ 4 S Q9|V18|K23 $ 4 1572 4 162.149493 6 0.897472 4 vcr 

383 j 0.99 i 4 S K17|T21|H33 S 4 1486 4 82.159457 4 0.898299 4 \cr 

384 I 0.99 ! 4 5 T9lS18|K30 $ 4 1425 4 25.486414 4 C.tl98892 4 \ci 

385 I 0.99 J 4 5 GS | A14 |G18 |H20 $ 4 1399 4 0.016110 4 0.8989S4 4 Vt 

3S6 ( 0.99 ! 4 $ T12|S15|H20|I22jE23tX24 $ 4 1393 4 0. 000820 4 0.09979?. 4 vrr 
387 I 1.00 I 4 S II jY4{ -22|S23|R24 $ 4 1393 4 0.040715 4 0.899788 4 \rr 
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$fm * $argv[0]; Fileprobsort.pl: 1/1 

open (IN, $£m) ; 
eprob = <IN>; 
chop @prob; 
close (IN); 

@prob = grep (/cr/, Gprob) ; 

open (TEMP, -> probsort . temp - ) ; 

foreach ((iprob) 
( 

print TEMP "$_\n"; 

J 

close (TEMP); 

# exit; 

$ f m = $ f m . ' . prob • ; 

# print " fm: $fm\n* ; 

'sort -o prob.tmp -n0123 4 567890 . +9 probsor t . temp ' ; 
l rra probsor t . temp* ; 

open (IN, -prob.tmp-); 
<?m2 = <IN>; 
chop Qm2; 
close (IN) ; 

# rm prob . tmp v ; 

open (TEMP , -> $fm-); 

Stotal = scalar <3m2 ; 
Si = C; 
foreach (Gm2) 
( 

print! TEMP "*3d j % . 2 f ! %s\n". Si. (Si / Stotal). $_; 
$ i » «■ ; 

} 
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A[GAP43 RATGAP43] 
D[nAChRd RNZCRD1] 
A[PTN RATHBGAM] 
D[Insl RNINS1] 
B[cjun RNRJG9] 
AfCCOl RATMTCYTOC] 
A[DD63.2 (I I)] 

A[GAP43 RATGAP43] 
B[nAChRa4 RATNARAA] 
A[CC02 RATMTCYTOC] 

A[GAP43 RATGAP43] 
C[mGluR2 RATMGLURB] 
BfFGFR RATFGFR1] 
B[SOD RNSODR] 

B[NMDA2D RNU08260] 
B[EGF RATEPGF] 

B[G67I80/86 RATGAD67] 

A[MAP2 RATMAP2] 
Afsynaptophysin RNSYN] 
B[ChAT (*)] 
A[GRa2 (|)] 
A[GRb3 RATGARB3] 
C[mGluR8 MMU 17252] 
D[nAChRa6 RATNARA6S] 
A[trkB RATTRKB1] 
A[PTN RATHBGAM] 
B[IGF II RATGFI2] 
A[H2AZ RATHIS2AZ] 
A[TCP (I I)] 

A[CC02 RATMTCYTOC] 
A[DD63.2 (I I)] 

D[GFAP RNU03700] 
A[NT3 RATHDNFNT] 
C[PDGFb RNPDGFBCP] 
C[cfos RNCFOSR] 

Ffcellubrevin s63830] 
B[InsR RATINSAB] 

A[GAD67 RATGAD67] 
B[5HT2 RATSR5HT2] 

F[cellubrevin s63830] 
B[InsR RATINSAB] 

A[nestin RATNESTIN] 
B[CNTF RNCNTF] 



A[ODC RATODC] 
D[nAChRe RNACRE] 
B [FGFRRATFGFR 1 ] 
Afcyclin A RATPCNA] 
A[TCP (I I)] 

A[CC02 RATMTCYTOC] 



A[GAD65 RATGAD65] 
BfFGFR RATFGFR1] 



F[NFM RATNFM] 
B[nAChRa4 RATNARAA] 
A[IGF I RATIGFIA] 
A[CC02 RATMTCYTOC] 

D[nAChRe RNACRE] 
B[TGFR RATTGFBIIR] 

A[SOD RNSODR] 

A[GAP43 RATGAP43] 
A[neno RATENONS] 
AfODC RATODC] 
A[GRa3 RNGABAA] 
A[GRg3 RATGABAA] 
B[NMDA2B RATNMDA2B] 
DfnAChRd RNZCRD 1 ] 
A[CNTFR S54212] 
BfFGFR RATFGFR1] 
A[IP3R2 RNITPR2R] 
Bfcjun RNRJG9] 
Ffactin RNAC01] 
AfSCl RNU19135] 



D[GRb2 RATGARB2] 
BfCNTF RNCNTF] 
BfPDGFR RNPDGFRBE] 



D[G67I86 RATGAD67] 



C[mGluR6 RATMGLUR6.] 
B[Ins2 RNINS2] 

D[mGluR6 RATMGLUR6.] 
A[SC2 RNU19136] 

BfTH RATTOHA] 
B[EGF RATEPGF] 



B[nAChRa4 RATNARAA] 0.00000 
AfCNTFR S54212] 
BfTGFR RATTGFBIIR] 
A[H2AZ RATHIS2AZ] 
Ffactin RNAC01] 
A[SC1 RNU19135] 



A[GRg2 (#)] 
B[cjun RNRJG9] 



0.00000 



A[G67I80/86 RATGAD67] 0.00000 
BfnAChRaS RATNACHRR] 
Bfcjun RNRJG9] 



C[mAChR4 RATACHRMD] 0.00000 



B[SC7 RNU19141] 0.00000 

BfLl S55536] 0.00000 
F[GAT1 RATGABAT] 
BfNOS RRBNOS] 
A[GRa5 (#)] 

B[rnGluR3 RATMGLURC] 
B[nAChRa4 RATNARAA] 
BfSHTlb RAT5HT1BR] 
A[MK2 MUSMK] 
Dflnsl RNINS1] 
Afcyclin A RATPCNA] 
BfBrm (I I)] 

AfCCOl RATMTCYTOC] 
D[SC6RNU19140] 



D[NMDA2C RATNMCA2C] 0.00000 
DfbFGF RNFGFT] 
Afcyclin B RATCYCLNB] 



BflGF I RATIGFIA] 



0.00000 



CfmAChRJ RATACHRMB] 0.00000 



D[5HT3 MOUSE5HT3] 0.00000 



C[mAChR4 RATACHRMD] 0.00000 
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A[nestin RATNESTIN] 
A[MK2 MUSMK] 

A[ODC RATODC] 
C[NGF RNNGFB] 
A[MK2 MUSMK] 
Dflnsl RNINS1] 
A[H2AZ RATHIS2AZ] 
Ffactin RNAC01] 
A[DD63.2 (I I)] 

A[GAP43 RATGAP43] 
B[nAChRa4 RATNARAA] 
B [FGFR RATFGFR 1 ] 

B[TH RATTOHA] 
B[Brm(II)] 

DfmGluRl RATGPCGR] 
A[EGFR RATEGFR] 

F[NFL RATNFL] 
D[nAChRa2 RATNNAR] 
A[SC2RNU19136] 

D[MOG RATMOG] 
D[mGluR4 RATMGLUR4B] 
A[IGFR2 MMU04710] 

A[GAP43 RATGAP43] 
B[nAChRa5 RATNACHRR] 
B[cjun RNRJG9] 

A[cellubrevin s63830] 
A[CRAF RATRAFA] 

B[keratin RNKER19] 
B[CNTF RNCNTF) 



B[TH RATTOHA] 
B[IGF II RATGFI2] 

D[nAChRd RNZCRD1] 
D[trk RATTRKPREC] 
A[PTN RATHBGAM] 
B[IGF II RATGFI2] 
B[Brm (I I)] 

A[CCO! RATMTCYTOC] 



F[NFM RATNFM] 
B[nAChRa5 RATNACHRR] 
B[cjun RNRJG9] 

A[MK2 MUSMK] 



D[mGluR4 RATMGLUR4B] 
AflGFRl RATIGFI] 

D[mGIuR4 RATMGLUR4B] 
D[5HT3 MOUSE5HT3] 



B[GRal (#)] 

D[nAChRa2 RATNNAR] 
C[IP3R3 RATIP3R3X] 

F[NFM RATNFM] 
BfFGFR RATFGFR 1] 
A[CC02 RATMTCYTOC] 

A[GRbl RATGARB1] 
B[IP3R1 RATI145TR] 

A[cellubrevin s63830] 
A[IGF I RATIGFIA] 



C[NGF RNNGFB] 0.00000 
B[Brm(I I)] 

D[nAChRe RNACRE] 0.00000 

A[CNTFR S54212] 

B[TGFR RATTGFBIIR] 

Afcyclin A RATPCNA] 

A[TCP(I I)] 

A[SC1 RNU19135] 



C[mGluR2 RATMGLURB] 0.00000 
B[trkC RATTRKCN3] 
A[CC02 RATMTCYTOC] 



B[IGF II RATGFI2] 



0.00000 



D[nAChRa2 RATNNAR] 0.00000 
A[IGFR2 MMU04710] 

D[mGluR6 RATMGLUR6.] 0.00000 
A[IGFR1 RATIGFI] 



D[mGluRl RATGPCGR] 0.00000 
A[EGFR RATEGFR] 



B[nAChRa4 RATNARAA] 0.00000 
A[IGF I RATIGFIA] 



A[IGF I RATIGFIA] 0.00000 



B[TH RATTOHA] 0.00000 
A[lnsR RATTNSAB] 
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CLAIMS 

1 . A coincidence detection method for use with a data set having a number of attributes, the 
method comprising the steps of: 

representing a set of M objects in terms of a number N A of variables 
5 ("attributes"), where an attribute is said to occur in an object if the object 

possesses the attribute; 

sampling a subset of r { out of the M objects, for each iteration among a 
predetermined number of iterations; 

detecting and recording coincidences among sets of k of the attributes in each 
10 sampled subset of objects, a coincidence being the co-occurrence of 1 ^ k <, 

N A attributes in the same hj out of r { objects in the sampled subset, where 0 <, 
hi * r s ; 

determining an expected count of coincidences for any set of k attributes and 
a predetermined number of iterations of sampling and coincidence-counting 

15 as described above, the determining being performed before sampling and 

collecting, at the same time or after sampling and collecting; 
comparing, for any set of k attributes and number of iterations of sampling 
and coincidence-counting, the observed count versus the expected count of 
coincidences, and from this comparison determining a measure of correlation 

20 (or association, or dependence) for the set of k attributes; and 

reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a set of k of the N A attributes which have been 
determined by this process to have a value for a chosen correlation measure 
above a predetermined threshold value. 

25 2. A coincidence detection method for use with a data set of objects having a number of 
attributes, the method comprising the steps of: 

sampling a subset of the data set for a predetermined number of iterations, 
each iteration the sampled subset of the data set having for each object the 
same subset of attributes; 
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detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
5 recording counts of coincidences in each sampled subset of the data set being 

performed before, at the same time or after sampling, detecting and recording 
counts of coincidences in other subsets; 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 

i 0 detecting and recording; 

♦ comparing, for each coincidence of interest, the observed count of 

coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 

15 • reporting a set of k-tuples of correlated attributes, where a k-tuple of 

correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 

3. The coincidence detection method of claim 2, wherein the comparison of observed 
and expected counts is calculated using a ChernofFbound on tail probabilities. 



20 4. The coincidence detection method of claim 2, wherein the counts are recorded by 
storing a running total of the count of each coincidence over all of the sampled 
subsets. 



5. A method for visual exploration of a data set of objects having a number of 
attributes, the method comprising the steps of: 
25 • sampling a subset of the data set for a predetermined number of iterations, 

each iteration the sampled subset of the data set having for each object the 
same subset of attributes; 

detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
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values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
recording counts of coincidences in each sampled subset of the data set being 
performed before, at the same time or after sampling, detecting and recording 
counts of coincidences in other subsets; 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
detecting and recording; 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 

reporting a set of k-tuples of correlated attributes to a user through a 
graphical interface, where a k-tuple of correlated attributes is a plurality of 
attributes for which the measure of correlation is above a respective pre- 
determined threshold. 

6. A pre-processing method for use with a data modelling unit to capture and report to 
the data modelling unit higher order interactions of a data set of objects having a 
number of attributes, the method comprising the steps of: 

♦ sampling a subset of the data set for a predetermined number of iterations, 
each iteration the sampled subset of the data set having for each object the 
same subset of attributes; 

detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset, where the plurality of 
attribute values is the same for each occurrence, the detecting and recording 
counts of coincidences in each sampled subset being performed before, at the 
same time or after sampling, detecting and recording counts of coincidences 
in other subsets; 
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determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
detecting and recording; 

comparing, for each coincidence of interest, the observed count of 
5 coincidences versus the expected count of coincidences, and from this 

comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 
• reporting to the data modelling unit a set of k-tuples of correlated attributes, 
where a k-tuple of correlated attributes is a plurality of attributes for which 
10 the measure of correlation is above a respective pre-determined threshold. 

7. A correlation elimination method for use with a data set of objects having a number 
of attributes, the method comprising the steps of: 

sampling a subset of the data set for a predetermined number of iterations, 
each iteration the sampled subset of the data set having for each object the 
same subset of attributes; 

detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
recording counts of coincidences in each sampled subset being performed 
before, at the same time or after sampling, detecting and recording counts of 
coincidences in other subsets; 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
detecting and recording; 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 
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eliminating a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 

8. The method of claim 2, wherein the objects are sales transactions, each transaction 
5 comprising one or more purchased products, and the attributes are instances of sale 

of particular products or types of products. 

9. The method of claim 2, wherein the objects are time slices and the attributes are the 
status of elements in a system. 

10. The method of claim 2, wherein the objects are time slices and the attributes are 
10 prices, or price changes of, financial instruments or commodities. 

1 1 . The method of claim 2, wherein the steps of the method are represented by the 
following pseudo-code: 





0. 


begin 




1. 


read (MATRIX); 


15 


2. 


read (R, T); 




3. 


compute_first_order_marginals(MATRIX); 




4. 


csets :={}; 




5. 


for iter = 1 to T do 




6. 


sampled_rows :=rsample(R, MATRIX): 


20 


7. 


attributes :=get_attributes(sampled_rows); 




8. 


all_coincidences -find_all_coincidences(attributes); 




9. 


for coincidence in all coincidences do 




10. 


if cset_already_exists(coincidence, csets) 




11. 


then update_cset(coincidence, csets); 


25 


12. 


else add_new_cset (coincidence, csets); 




13. 


endif 




14. 


endfor 
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15. endfor 

16. for cset in csets do 

1 7. expected :=compute_expected_match_count(cset); 

1 8 . observed :=get_observed _match_count(cset); 

19. stats :=update_stats(cset, hypoth_test(expected, observed)), 

20. endfor 

2 1 . print_final_stats(csets, stats); 
22 end 

12. A coincidence detection system for use with a data set of objects, each object having 
a plurality of attributes, the system comprising: 

means for sampling a subset of the data set for a predetermined number of 
iterations, each iteration the sampled subset of the data set having for each 
object the same subset of attributes; 

means for detecting, and recording counts of, coincidences in each sampled 
subset of the data set, a coincidence being the co-occurrence of a plurality of 
attribute values in one or more objects in a sampled subset of the data set, 
where the plurality of attribute values is the same for each occurrence, the 
detecting and recording counts of coincidences in each sampled subset being 
performed before, at the same time or after sampling, detecting and recording 
counts of coincidences in other subsets; 

means for determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
detecting and recording; 

means for comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 

means for reporting a set of k-tuples of correlated attributes, where a k-tuple 
of correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 
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13. The coincidence detection system of claim 12, wherein the means of the system in 
the aggregate carry out a method represented by the following pseudo-code: 

0. begin 

1. read (MATRIX); 
5 2. read (R, T); 

3. compute_first_order_marginals(MATRIX); 

4. csets :={}; 

5. for iter = 1 to T do 

6. sampled_rows :=rsample(R, MATRIX): 

10 7. attributes :=get_attributes(sampled_rows); 

8. all_coincidences := r find_all_coincidences(attributes);* 

9. for coincidence in all_coincidences do 

10. if cset_already__exists(coincidence, csets) 

11. then update_cset(coincidence, csets); 
15 12. else add_new_cset(coincidence, csets); 

13. endif 

14. endfor 

15. endfor 

16. for cset in csets do 

20 17. expected :=compute_expected_match_count(cset); 

18. observed :=get_observed_match_count(cset); 

19. stats :=update_stats(cset, hypoth_test(expected, observed)); 

20. endfor 

21. print_final_stats(csets, stats); 
25 22. end 



14. The coincidence detection system of claim 12, wherein the means for sampling a 
subset of the data set comprises means for dividing the data set into subsets for 
sampling. 
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15. The coincidence detection system of claim 14, wherein the means for detecting and 
recording counts of coincidences comprises an array of processing nodes, each 
processing node detecting and recording a respective subcount of coincidences, and 
wherein the means for comparing, for each coincidence of interest, said observed 
count of coincidences to said expected count of coincidences comprises means for 
merging said subcounts to provide said observed count. 

16. The coincidence detection system of claim 15, wherein at least one of said processing 
nodes comprises a respective subarray of processing nodes that detect and record 
respective subsubcounts of coincidences, and wherein said means for merging 
merges said subsubcounts to provide said subcounts and/or said observed count. 

17. The coincidence detection system of claim 15 or 16, wherein each processing node 
comprises memory including an input buffer for storing received subsets of the data 
set and an output buffer for storing the subcount or the subsubcount; and a memory 
bus that transfers data to and from the memory. 

18. Coincidence detection programmed media for use with a computer and with a data 
set of objects having a number of attributes represented in a matrix of objects versus 
attributes, the programmed media comprising: 

a computer program stored on storage media compatible with the computer, 
the computer program containing instructions to direct the computer to: 
sample a subset of the data set for a predetermined number of 
iterations, each iteration the sampled subset of the data set having for 
each object the same subset of attributes; 

detect and record counts of coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of 
attribute values in one or more objects in a sampled subset of the data 
set, where the plurality of attribute values is the same for each 
occurrence, the detecting and recording counts of coincidences in 
each sampled subset being performed before, at the same time or after 
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sampling, detecting and recording counts of coincidences in other 
subsets; 

determine an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after 
sampling, detecting and recording; 

compare, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determine a measure of correlation for the plurality of 
attributes for the coincidence; and 

report a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure 
of correlation is above a respective pre-determined threshold. 



19. Coincidence detection system for use with a data set of objects having a number of 
attributes, the system comprising: 
a computer; and 

a computer program on media compatible with the computer, the computer program 

directing the computer to: 

sample a subset of the data set for a predetermined number of iteration, each 
iteration the sampled subset having for each object the same subset of 
attributes, 

detect, and record counts of, coincidences in each sampled subset of the data 
set, a coincidence being the co-occurrence of a plurality of attribute values in 
one or more objects in a sampled subset of the data set, where the plurality of 
attribute values is the same for each occurrence, the detecting and recording 
counts of coincidences in each sampled subset being performed before, at the 
same time or after sampling, detecting and recording counts of coincidences 
in other subsets; 

determine an expected count for each coincidence of interest, the determining 
being performed before, at the same time, or after sampling, detecting and 
recording, 
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compare, for each coincidence of interest, the observed count of coincidences 
versus the expected count of coincidences, and from this comparison 
determine a measure of correlation for the plurality of attributes for the 
coincidence, and 

report a set of k-tuples of correlated attributes, where a k-tuple of correlated 
attributes is a plurality of attributes for which the measure of correlation is 
above a respective pre-determined threshold. 

20, The coincidence method of claim 2, further comprising the step of representing the 
objects and attributes in a matrix of objects versus attributes prior to sampling the 
data set, the data set being sampled by sampling the matrix. 

21 . A product having a set of attributes selected by: 

• sampling a subset of a data set representing objects versus attributes for a 
predetermined number of iterations, each iteration the sampled subset having 
for each object the same subset of attributes, 

♦ detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
recording counts of coincidences in each sampled subset being performed 
before, at the same time or after sampling, detecting and recording counts of 
coincidences in other subsets, 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
detecting and recording, 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence, and 
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reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 

22. A product defined by applying a set of rules generated from: 

sampling a subset of a data set representing objects versus attributes for a 
predetermined number of iterations, each iteration the sampled subset having 
for each object the same subset of attributes, 

detecting and recording counts of coincidences in each sampled subset of the 
data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
recording counts of coincidences in each sampled subset being performed 
before, at the same time or after sampling, detecting and recording counts of 
coincidences in other subsets, 
• determining an expected count for each coincidence of interest, the 

determining being performed before, at the same time, or after sampling, 
detecting and recording, 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence, and 

reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 

23. A method comprising: 

the method of claim 2, and 
the further step of: 

applying rules that are defined by the reported correlated attributes. 
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24. A peptide or peptidomimetic including a structural motif of the V3 loop of HIV 
envelope protein including spatial coordinates of residues A18/Q31/H33. 

25. A pharmaceutical composition comprising a ligand that interacts with a protein 
having a structural motif identified using the method of claim 2, and a 
pharmaceutically acceptable carrier or exicipient therefor. 

26. The pharmaceutical composition of claim 25, wherein the ligand comprises chemical 
moieties of suitable identity and spatially located relative to each other so that the 
moieties interact with corresponding residues or portions of the motif. 

27. The pharmaceutical composition of claim 26, wherein the ligand, by interacting with 
the motif, interferes with function of a region of the protein comprising the motif 

28. An diagnostic agent comprising a ligand that interacts with a protein having a 
structural motif identified using the method of claim 2, and a detectable label linked 
to the ligand. 

29. A pharmaceutical composition for interacting with an envelope protein of human 
immunodeficiency virus (HIV), the envelope protein including a structural motif of 
the V3 loop having spatial coordinates of residues A18/Q31/H33, comprising a ligand 
including at least one functional group that interacts with the motif, and a 
pharmaceutically acceptable carrier or exicipient therefor. 

30. The pharmaceutical composition of claim 29, wherein the ligand includes at least one 
functional group capable of binding to and being present in an effective position in 
said ligand to bind to residue 18, at least one functional group capable of binding to 
and being present in an effective position in said ligand to bind to residue 3 1 , and at 
least one functional group capable of binding to and being present in an effective 
position in said ligand to bind to residue 33. 
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31. A method of designing a ligand to interact with a structural motif of an envelope 
protein of human immunodeficiency virus (HIV), the method comprising the steps 
of: providing a template having spatial coordinates of residues A18, Q3 1 and H33 in 
the V3 loop of HIV envelope protein, and computationally evolving a chemical 
ligand using an effective algorithm with spatial constraints, so that said evolved 
ligand includes at least one effective functional group that binds to the motif. 

32. The method of claim 3 1 s wherein the ligand comprises: at least one functional group 
capable of binding to and being present in an effective position in said ligand to bind 
to residue 18, at least one functional group capable of binding to and being present in 
an effective position in said ligand to bind to residue 3 1, and at least one functional 
group capable of binding to and being present in an effective position in said ligand 
to bind to residue 33. 

33. A method of identifying a ligand to bind with a structural motif of an envelope 
protein of human immunodeficiency virus (HIV), the method comprising the steps 
of: providing a template having spatial coordinates of A18, Q31 and H33 in the V3 
loop of HIV envelope protein; providing a data base containing structure and 
orientation of molecules; and screening said molecules to determine if they contain 
effective moieties spaced relative to each other so that the moieties interact with the 
motif 

34. The method of claim 33, wherein a first moiety of the molecule interacts with residue 
1 8, a second moiety of the molecule interacts with residue 3 1 and a third moiety of 
the molecule interacts with residue 33. 

35. Antigens and vaccines embodying the covarying k-tuples described herein. 

36. A product being defined by its interaction with a set of attributes selected by: 

sampling a subset of a data set representing objects versus attributes for a 
predetermined number of iterations, each iteration the sampled subset of ;he 
data set having for each object the same subset of attributes, 
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detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset, where the plurality of 
attribute values is the same for each occurrence, the detecting and recording 
counts of coincidences in each sampled subset being performed before, at the 
same time or after sampling, detecting and recording counts of coincidences 
in other subsets, 

• determining an expected count for each coincidence of interest, the 

determining being performed before, at the same time, or after sampling, 
detecting and recording, 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence, and 

reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
correlation is above a pre-determined threshold. 

37. The method of claim 2, wherein the objects are compounds and the attributes 
comprise particular chemical moieties. 

38. The method of claim 2, wherein the objects are peptides or proteins and the 
attributes comprise particular structural or substructural patterns or motifs. 

39. The method of claim 2, wherein the objects are selected from the group consisting of 
compounds, molecular structures, nucleotide sequences and amino acid sequences 
and the attributes are features of the selected objects. 

40. The method of claim 2, wherein the objects are time slices and the attributes are 
biological parameters of genes or gene products. 
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41. The method of claim 2, wherein the objects are documents that are electonically 
stored and/or electronically indexed and the attributes are topics. 

42. The method of claim 2, wherein the objects are customers and the attributes 
comprise products purchased or not purchased by those customers. 

43. The method of claim 42, wherein the attributes further comprise mailings made or 
not made to the customers. 

44. The method of claim 2, wherein the objects comprise products and the attributes 
comprise customers that have or have not purchased those products. 

45. The method of claim 44, wherein the attributes further comprise demographic 
variables of the customers. 

46. The method of claim 2, wherein the objects are people with a particular disease or 
discorder and the attributes are potential contributing factors for the disease or 
disorder. 

47. The method of claim 2, wherein the objects are people with a number of different 
diseases or disorders and the attributes are potential contributing factors for the 
diseases or disorders. 

48. The method of claim 2, wherein the objects comprise factors potentially contributing 
to a disease or disorder and the attributes are people with or without those factors, 
wherein the method associates groups of people of substantially equivalent risk for 
the disease or disorder. 

49. The method of claim 2, wherein the objects are time slices and the attributes 
comprise the state of components in a system at time slices prior to failure of the 
system, wherein the method associates component states that may potentially cause 
failure of the system. 
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50. The coincidence detection method of claim 1, where r { is the same for every iteration. 



5 1 . The method of claim 2, further comprising the steps of first creating a database of 

transitions between system states, wherein a system state is represented by a value of 
a state variable, over a chosen time quantum, and presenting the database, in whole 
5 or part, as a data set such that each state to state transition set corresponds to one of 

M objects and so that each state variable corresponds to an attribute. 



52. The method of claim 2, further comprising the steps of first creating a database of 
states and actions covering a chosen time quantum and presenting the database, in 
whole or part, as a data set such that each state/action/state triple corresponds to one 
10 of M objects and so that each state variable or action type corresponds to an 

attribute. 



53. A coincidence detection method for use with a data set of objects having a number of 
attributes represented in a matrix of objects versus attributes, the method comprising 
the steps of: 

15 • sampling a subset of the matrix for a predetermined number of iterations, 

each iteration the sampled subset of the matrix having for each object the 
same subset of attributes; 

detecting, and recording counts of, coincidences in each sampled subset of 
the matrix, a coincidence being the co-occurrence of a plurality of attribute 

20 values in one or more objects in a sampled subset of the matrix, where the 

plurality of attribute values is the same for each occurrence, the detecting and 
recording counts of coincidences in each sampled subset being performed 
before, at the same time or after sampling, detecting and recording counts of 
coincidences in other subsets; 

25 • determining an expected count for each coincidence of interest, the 

determining being performed before, at the same time, or after sampling, 
detecting and recording; 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
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comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 

reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
5 correlation is above a respective pre-determined threshold. 

54. The method of claim 1, wherein numerical correlation values are reported along with 
the set of k-tuples of correlated attributes. 
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COINCIDENCE DETECTION METHOD, 
PRODUCTS AND APPARATUS 

TECHNICAL FIELD 

The invention relates to methods, devices and systems for coincidence detection 
among a multitude of variables. In addition, the invention relates to applying coincidence 
detection methods to various fields, and to products derived from such application. 



BACKGROUND ART 

k-tuples of Correlated Attributes 

The discovery of correlations among pairs or k-tuples of variables has applications in 
10 many areas of science, medicine, industry and commerce. For example, it is of great interest 
to physicians and public health professionals to know which lifestyle, dietary, and 
environmental factors correlate with each other and with particular diseases in a database of 
patient histories. It is potentially profitable for a trader in stocks or commodities to discover 
a set of financial instruments whose prices covary over time. Sales staff in a supermarket 
15 chain or mail-order distributor would be interested in knowing that consumers who buy 
product A also tend to buy products B and C, and this can be discovered in a database of 
sales records. Computational molecular biologists and drug discovery researchers would 
like to infer aspects of 3D molecular structure from correlations between distant sequence 
elements in aligned sets of RNA or protein sequences. 

20 One formulation of the general problem which encompasses many diverse 

applications, and which facilitates understanding of the principles described herein is a matrix 
of discrete features in which rows correspond to "objects" (such as individual patients, stock 
prices, consumers, or protein sequences) and the columns correspond to features, or 
attributes, or variables (such as lifestyle factors, stocks, sales items, or amino acid residue 

25 positions). 

Mathematical methods for determining a measure of the type, degree, and statistical 
significance of correlation between any two, or even three or four, particular variables are 
widespread and well-understood. These methods include linear and nonlinear regression for 
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continuous variables and contingency table analysis techniques for discrete variables. 
However, great difficulties arise when one tries to estimate correlation - or just estimate 
joint or conditional probabilities - over much larger sets of variables. This intractability has 
one main cause - there are too many joint attribute-value probability density terms - and this 
5 manifests itself in two serious problems: (1) computing and storing frequency counts over all 
terms, over the database, requires too much computation and memory; (2) there is usually an 
insufficient number of database records to support reliable probability estimates based on 
those frequency counts. 

Let us consider some details. For M records (objects), variables (attributes, 
10 fields), and supposing that each variable has the same set of \A\ possible values, 

N N , 

there are ( k ) - (AWt)Ut k-tuples of columns. Adding the number of k-tuples for each k = 1, 
2, . . ., N A results in 2 N - 1 such tuples of all sizes. This exponential complexity has been a 
major obstacle standing in the way of higher-order probability estimation and correlation 
1 5 detection methodologies. 

One natural way to think about this complexity is in terms of the power set of the set 
of column variables. This power set forms a mathematical lattice under the operation c, a 
"tower" corresponding to a graph whose nodes are subsets of this set of column variables. 
(Note that if a set has N members, the power set has 2 N members). From .his viewpoint, 
20 two nodes representing subsets o, and o 2 are connected if and only if either o l c a 2 or o 2 c 

o,. We say that o 2 's node is above o/s if o x c o 2 . This gives a natural meaning to the term 
"higher-order", as appearing higher up the tower. We call the bottom, the null set node, the 
Oth tier; the single column terms form the first tier, and so on. 



Continuing with the tower analogy, we note that each "floor" of this edifice contains 

25 N 

( k ) "suites", and each suite contains \A\ k "rooms". In other words, the kth level of the 
lattice 

N 

corresponds to ( k ) different k-tuples of column variables, and associated with each k-tuple 
30 is an 
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(\A\ by | A\ . . . by | A\[) contingency table, each cell of which must store the counted 
frequency of a particular joint symbol (a iU a a , . . . , a ik ) were one to use a classical 
contingency table test for the correlation between those particular k columns. (See Figure 1). 

For any k e { 1, 2, . . . , N}, for any particular k-tuple of columns (c JU c j2 ,. , c >jfe ), 
5 there are \A\ k possible joint values. For any & e {1, 2, . . ., N} y for any particular k-tuple of 
columns (c, 1? c, 2 ,. . . , c,*), the estimation of Kullback divergence or other correlation function 
using the dataset is at least an Q(Mk) or Q(|>*|*) computation, depending upon the relative 
sizes ofM, k and \A\. 

A comprehensive probabilistic model of the database must be able to specify 
probability 

N 

estimates for S ^, ( k ) \A\ k terms. This means, for example in the computational 
molecular biology domain, that for a tiny heptapeptide sequence family, each sequence 
having a length of seven amino acid residues, there are 1,801,088,540 terms to specify. For 
an unrealistically small RNA of fifteen nucleotides in length, over the smaller RNA alphabet 
of four base symbols, there are 30,517,578,124 terms. 

Clearly the models can become intractably huge. What about the space of possible 
models through which a modelling/learning procedure must search? Consider a latent- 
variable model, which seeks to explain correlations between sets of observable variables by 
20 positing latent variables whose states influence the observables jointly. Since each model 
must specify a set of k-tuples of variables, and there are exp(2, 2 N ) (i.e., 2 to the power 2*) 
such sets, there are exp(2, 2*) possible models in the worst-case search space. 

Various methods for determining a measure of higher-order probabilities will 
circumvent the combinatorial explosion through severe prior restrictions on the width k (See 
25 Figure 3), the locality (Figure 2), the number, or the degrees of correlation of the higher- 
order features sought, and on the kinds of models entertained (See Figure 4). 
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Three Goals of Probability Estimation 

It is useful, before discussing details of existing methods and of the current invention, 
to delineate three different possible goals of probability estimation in large datasets, each 
corresponding to a large body of research and current practice: 

1 . Estimation of the fully-specified, fully higher-order joint probability 
distribution: Estimate a probability density q that specifies 

q(a iX @c n , a a @c a ,. . , a ik @c ik ) 

for all k-tuples of attributes and possible values. 

2. Hypothesis testing, for particular hypotheses concerning particular attributes 
and particular variables: For example, are the data consistent with the hypothesis 
that columns c n , c a ,. . . , C& are independent? 

3. Feature detection, or "data mining": Detect the most suspicious coincidences, 
for example, joint attribute occurrences that are more probable than would be 
predicted from lower-order marginals. Related to this, find the most highly 
correlated k-tuples of columns. 

It is the feature detection and data mining applications that are most relevant to the 
present invention. However, some of the most successful ways to estimate a full higher- 
order joint probability distribution of a database require the specification of exactly those 
higher-order terms which represent high correlations among sets of kz2 variables and 
invoking maximum entropy assumptions, and therefore the current invention is aimed at 
those applications as well. 

Related Work 

Various mathematical and computational methods have been proposed and used to 
estimate higher-order probabilities, to detect correlations, and to model higher-order 
database relationships. All such prior methods either perform a global, sometimes 
exhaustive search through all possible k-tuples of variables, which is too costly, or they 
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avoid the complexity altogether by limiting their search to only k-tuples of a specific fixed, 
small size k. (Often, k = 2 so only pairwise correlations are ever considered). 

Below are listed some representative examples of related work. 

Assuming Independence between Attributes. The easiest way to avoid the 
5 complexity of higher-order correlations is just to pretend that they do not exist. Many of the 
algorithms and computer programs, historically dominant in some fields of application of the 
current method, simply construct and use a model of the data in which all variables, all 
attributes, are independent. For example, the modelling of DNA and protein sequences, in 
computational molecular biology, is often done with consensus sequences and profiles^ 
10 which assume incorrectly that the different base or amino acid residue positions are 

independent. Reliance on such models can obscure crucial functional and structural insights 
into the DNA or proteins being modelled. 

Prior Limits on k. One proposal for Gibbs models of databases is based on the 
use of Gibbs potentials, and it proposes a hashing method for calculating these special terms. 
15 Each Jfcth-order potential requires an estimation of a Mi-order joint probability density as 

well as some number of lower-order (typically k-/th-order) densities. The asymptotic time 
complexity of Miller's pattern-collection subroutine, the major component of the potential 
calculation, is, when interpreted in our terminology: 

20 M ■ 2 N A 2* « O(MA0 

« (* ) 

where K = k max is the highest order of features for which one will search and by which one 
will represent database objects. This exponential blow-up prevents one from searching for 
higher-order features (HOFs) of any order k much higher than 4 or 5 in databases with 
25 hundreds of attributes. 

Many methods, in different application areas, simply limit k to k = 2. For example, 
pairwise inter-residue correlation methods discover second-order features that can be useful 
in the prediction of protein structure and function and that can be built into classifiers more 
sensitive than first-order sequence classifiers and fold-recognizers. To the extent that k-ary 
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interactions are important, and to the extent that such interactions leave traces in sets of 
homologous sequences, the pairwise methods are deficient. One can try to infer k-ary 
correlations from sets of 2-ary correlations [9] (essentially by computing the transitive 
closure of the "CorrelatesWith" binary relation), but this heuristic can lead to trouble: high 
5 pairwise correlations among variables x, y t z do not in general imply, nor are they necessarily 
implied by, a high 3-ary correlation (as measured by Kullback divergence) of the three 
variables x, y, z. In other application areas, such as the study of multiple drug interactions, it 
is similarly true that important higher-order relationships can be missed by pairwise 
correlation detection methods. 

10 The Paturi et ai Method for Identifying the Most Correlated Pair of Random 

Variables. A method has been reported for the problem of finding the most highly 
correlated pair X^, Xj of variables from among a large set of N random binary variables X,, 
X 2 , . . . , X N . The method is easily extended to finding the most correlated k-tuple of random 
binary variables, but at a significant increase in computational complexity, and only for k *2 

15 fixed a priori . It uses a definition of correlation that has Correlation (X t , Xj) = P[X^X]\ 
over some set of M samples {X m h X m 2 , . . X™^^^ (Here P[X t =X]\ means "the 
probability that variable X { has the same value, or state, as variable XJ). Much of the 
computational complexity, both time complexity and sample complexity, of their method can 
be incurred in trying to separate two or more nearly equally-correlated pairs (or k-tuples) of 

20 variables. 



The two variants of the Paturi method are asymptotically quadratic and sub- 
quadratic in N, respectively, the faster procedure requiring more sampling. When the 
method is extended to search for the biggest k-ary correlation, where correlation is now 
defined as P[X tl = X a = . . . = X iJc ], the time complexity grows to approximately 
25 Of&N^Q^N). Search for highly correlated attribute cliques of width k much greater than 5 
or 6 in very large datasets is once again ruled out. 



Hidden Markov Models. Hidden Markov Models (HMMs) have been used 
widely and with increasing success in recent years, in both automatic speech recognition and 
in the modelling of protein, DNA, and RNA sequences. 

-6- 

SUBST1TUTE SHEET (RULE 26) 

BNSDOCID: <WO 98431 82A1_IA> 



WO 98/43182 



PCT/CA98/00273 



Although some groups have reported significant success in modelling protein 
sequence families and continuous speech data with HMMs, nonetheless there are great 
improvements to be made in learning time and model robustness by the "hardwiring" of pre- 
selected higher-order features into HMMs. (This has been investigated for HMM-like 
5 recurrent neural networks, in different domains). 

Some of the same reasons why HMMs are very good at aligning the protein 
sequences or recorded utterances in the first place, using local sequential correlations, make 
such methods less useful for finding the important sequence-distant correlations in data that 
has already been partially or completely aligned. The phenomenon responsible for this 
10 dilemma is termed "diffusion". 

A first-order HMM, by definition, assumes independence among sequence columns, 
given a hidden state sequence. Multiple alternative state sequences can in principle be used 
to capture longer-range interactions, but the number of these grows exponentially with the 
number of k-tuples of correlated columns. 

15 The-Agrawal et ai Method for Discovery of Association Rules. This method 

was developed in perhaps the purest data mining context, the automatic extraction of 
knowledge-base rules from databases. It considers a database of M transactions (objects, 
rows) and N items (attributes, columns) and seeks to extract rules of the form a=*b. It 
therefore seek pairs of attributes a, b such that "transactions that contain a tend to contain 

20 V\ hence those pairs with high values for p(b\d). "People who buy CD players tend to buy 
CDs.", is just one example suggesting the potential commercial interests in such methods. 
(More generally, one can search for sets of attributes with high p(b u b 2 , . . ., b k \a l9 a 2y . ., 

A rule a^*b is said to have: 
25 1. confidence c if c% of transactions containing a also contain b (hence, roughly, if 

p(a.b) ^ _£ . 
( p(a) ) (100) ); 

s 

30 2. support s if s% of transactions contain a and b (hence, roughly, if p(a y b) ^ 100 ) . 
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The goals behind this method are different from the objectives of the current 
invention. However, the different objectives are brought closer together if one focuses on 
the Agrawal method's discovery of symmetric rules (so that the search is for attribute pairs 
5 displaying high 

values for both E ^ and 2 *^T l ), and if one reduces the emphasis on support (so that 
coincidences 

that are suspicious, even if occurring rarely, are sought). 

10 The Agrawal method is shown to have OQ\S\\ • MN) time complexity, where ||5|| is 

the sum of all values Support (a) for an exponentially large number of k-tuples a of 
attributes, of any size 1 < k $N y that reach a particular stage of processing in this procedure. 
Hence the method is 0(2*) in the worst case. A series of empirical tests are performed on 
what they considered to be realistic datasets for their domain. The running time of the 

15 procedure grew only linearly with the number M of transactions, but the number of items, or 
attributes, was held constant at N A = 1000, and their constructed datasets probably contained 
no correlated k-tuples of width k > 10. An analysis of their algorithm, which is based on an 
incremental build-up of Ath-order cliques from k-7th-order cliques, makes clear that the 
method takes much more computation to find wide HOFs (large k) than narrower HOFs 

20 (lower k) of equivalent statistical significance. 

Steeg, Robinson, Deerfield, Lappa - 1993. Some rough, heuristic methods have 
been presented for finding k-tuples of correlated residues (positions) in sets of aligned 
protein sequences. One of the presented methods employed one embodiment of a 
rudimentary version of the representation and detecting coincidences steps of the described 
25 herein- 
Alternative methods of, and devices for, finding correlations between attributes, and 
applications for those correlations, are required. 
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DISCLOSURE OF THE INVENTION 



In a first aspect the present invention provides a coincidence detection method for use with a 
data set of objects having a number of attributes. The base method includes the following 
steps: 

5 • representing a set of M objects in terms of a number N A of variables 

("attributes"), where an attribute is said to occur in an object if the object 
possesses the attribute; 

sampling a subset of r { out of the M objects, for each iteration among a 
predetermined number of iterations; 
10 • detecting and recording coincidences among sets of k of the attributes in each 

sampled subset of objects, a coincidence being the co-occurrence of 1 ^ k <. 
N A attributes in the same hj out of t { objects in the sampled subset, where 0 <; 

* determining an expected count of coincidences for any set of k attributes and 
15 a predetermined number of iterations of sampling and coincidence-counting 

as described above, the determining being performed before sampling and 
collecting, at the same time or after sampling and collecting; 
comparing, for any set of k attributes and number of iterations of sampling 
and coincidence-counting, the observed count versus the expected count of 
20 coincidences, and from this comparison determining a measure of correlation 

(or association, or dependence) for the set of k attributes; and 
reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a set of k of the N A attributes which have been 
determined by this process to have a value for a chosen correlation measure 
25 above a predetermined threshold value. 

In a second aspect the invention provides a coincidence detection method for use with a data 
set of objects having a number of attributes, the method comprising the steps of: 

sampling a subset of the data set for a predetermined number of iterations, 
each iteration the sampled subset of the data set having for each object the 
30 same subset of attributes; 
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detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
5 recording counts of coincidences in each sampled subset of the data set being 

performed before, at the same time or after sampling, detecting and recording 
counts of coincidences in other subsets; 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 

10 detecting and recording; 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 

15 • reporting a set of k-tuples of correlated attributes, where a k-tuple of 

correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 

In any of its aspects the comparison of observed and expected counts may be calculated 
using a ChernofF bound on tail probabilities, and counts may be recorded by storing a 
20 running total of the count of each coincidence over all of the sampled subsets. 

In a third aspect the invention provides a method for visual exploration of a data set of 
objects having a number of attributes, the method comprising the steps of: 

sampling a subset of the data set for a predetermined number of iterations, 
each iteration the sampled subset of the data set having the same number of 
25 objects although not necessarily the same objects and having for each object 

the same subset of attributes; 

detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the data set, where the 
30 plurality of attribute values is the same for each occurrence, the detecting and 
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recording counts of coincidences in each sampled subset of the data set being 
performed before, at the same time or after sampling, detecting and recording 
counts of coincidences in other subsets; 

determining an expected count for each coincidence of interest, the 
5 determining being performed before, at the same time, or after sampling, 

detecting and recording; 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
to attributes for the coincidence; and 

reporting a set of k-tuples of correlated attributes to a user through a 
graphical interface, where a k-tuple of correlated attributes is a plurality of 
attributes for which the measure of correlation is above a respective pre- 
determined threshold. 



15 In a fourth aspect the invention provides a pre-processing method for use with a data 

modelling unit to capture and report to the data modelling unit higher order interactions of a 
data set of objects having a number of attributes, the method comprising the steps of: 

sampling a subset of the data set for a predetermined number of iterations, 
each iteration the sampled subset of the data set having for each object the 

20 same subset of attributes; 

detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset, where the plurality of 
attribute values is the same for each occurrence, the detecting and recording 

25 counts of coincidences in each sampled subset being performed before, at the 

same time or after sampling, detecting and recording counts of coincidences 
in other subsets; 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
30 detecting and recording; 
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comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 
5 • reporting to the data modelling unit a set of k-tuples of correlated attributes, 

where a k-tuple of correlated attributes is a plurality of attributes for which 
the measure of correlation is above a respective pre-determined threshold. 

In a fifth aspect the invention provides a correlation elimination method for use with a data 
set of objects having a number of attributes, the method comprising the steps of: 
10 • sampling a subset of the data set for a predetermined number of iterations, 

each iteration the sampled subset of the data set having for each object the 
same subset of attributes; 

detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 

15 values in one or more objects in a sampled subset of the data set, where the 

plurality of attribute values is the same for each occurrence, the detecting and 
recording counts of coincidences in each sampled subset being performed 
before, at the same time or after sampling, detecting and recording counts of 
coincidences in other subsets; 

20 • determining an expected count for each coincidence of interest, the 

determining being performed before, at the same time, or after sampling, 
detecting and recording; 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
25 comparison determining a measure of correlation for the plurality of 

attributes for the coincidence; and 

eliminating a set of k-tupies of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 
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In any of the aspects, the objects may be sales transactions, each transaction comprising one 
or more purchased products, and the attributes may be instances of sale of particular 
products or types of products. The objects may be time slices and the attributes may be the 
status of elements in a system. The objects may be time slices and the attributes may be 
5 prices, or price changes of, financial instruments or commodities. 

In any of the aspects the steps of the method may be represented by the following pseudo- 
code: 

0. begin 

1. read (MATRIX); 
10 2. read(R,T); 

3 . compute_first_order_marginals(MATRIX) ; 

4. csets:={}; 

5 for iter = 1 to T do 
6. sampled_rows :=rsample(R, MATRIX): 
15 7. attributes :=get_attributes(sampled_rows); 

8. all_coincidences :=find_all_coincidences(attributes); 

9. for coincidence in all coincidences do 

10. if cset already_exists(coincidence, csets) 

1 1 . then update_cset(coincidence, csets); 
20 12. else add_new_cset(coincidence, csets); 

13 endif 

14 endfor 

15. endfor 

16. for cset in csets do 

25 17. expected :=compute_expected_match_count(cset); 

18. observed :=get_observed_match_count(cset); 

19. stats :=update_stats(cset, hypoth_test(expected, observed)); 
20 endfor 

2 1 . print_final_stats(csets, stats); 
30 22. end 



- 13- 



SUBSTITUTE SHEET (RULE 26) 



BNSDOCID: <WO 98431 82A1 JA> 



WO 98/43182 



PCT/CA98/00273 



In a sixth aspect the invention provides a coincidence detection system for use with a data 
set of objects, each object having a plurality of attributes, the system comprising: 

means for sampling a subset of the data set for a predetermined number of 
iterations, each iteration the sampled subset of the data set having for each 
5 object the same subset of attributes; 

means for detecting, and recording counts of, coincidences in each sampled 
subset of the data set, a coincidence being the co-occurrence of a plurality of 
attribute values in one or more objects in a sampled subset of the data set, 
where the plurality of attribute values is the same for each occurrence, the 
l o detecting and recording counts of coincidences in each sampled subset being 

performed before, at the same time or after sampling, detecting and recording 
counts of coincidences in other subsets; 

means for determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 

1 5 detecting and recording; 

means for comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 

20 • means for reporting a set of k-tuples of correlated attributes, where a k-tuple 

of correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 

In the system of the sixth aspect, the means for sampling a subset of the data set may 
comprise means for dividing the data set into subsets for sampling. The means for detecting 

25 and recording counts of coincidences may comprise an array of processing nodes, each 
processing node detecting and recording a respective subcount of coincidences, and the 
means for comparing, for each coincidence of interest, said observed count of coincidences 
to said expected count of coincidences may comprise means for merging said subcounts to 
provide said observed count. At least one of said processing nodes may comprise a 

30 respective subarray of processing nodes that detect and record respective subsubcounts of 
coincidences, and said means for merging merges said subsubcounts to provide said 
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subcounts and/or said observed count. Each processing node may comprise memory 
including an input buffer for storing received subsets of the data set and an output buffer for 
storing the subcount or the subsubcount; and a memory bus that transfers data to and from 
the memory. 

5 

In a seventh aspect the invention provides coincidence detection programmed media for use 
with a computer and with a data set of objects having a number of attributes, the 
programmed media comprising: 

a computer program stored on storage media compatible with the computer, 
10 the computer program containing instructions to direct the computer to: 

sample a subset of the data set for a predetermined number of 
iterations, each iteration the sampled subset of the data set having for 
each object the same subset of attributes; 
• detect and record counts of coincidences in each sampled subset of 
15 the data set, a coincidence being the co-occurrence of a plurality of 

attribute values in one or more objects in a sampled subset of the data 
set, where the plurality of attribute values is the same for each 
occurrence, the detecting and recording counts of coincidences in 
each sampled subset being performed before, at the same time or after 
20 sampling, detecting and recording counts of coincidences in other 

subsets; 

determine an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after 
sampling, detecting and recording; 
25 • compare, for each coincidence of interest, the observed count of 

coincidences versus the expected count of coincidences, and from this 
comparison determine a measure of correlation for the plurality of 
attributes for the coincidence; and 

report a set of k-tuples of correlated attributes, where a k-tuple of 
30 correlated attributes is a plurality of attributes for which the measure 

of correlation is above a respective pre-determined threshold. 
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In an eighth aspect the invention provides a coincidence detection system for use with a data 
set of objects having a number of attributes, the system comprising: 
a computer; and 

a computer program on media compatible with the computer, the computer program 
5 directing the computer to: 

sample a subset of the data set for a predetermined number of iterations, each 
iteration the sampled subset having for each object the same subset of 
attributes, 

detect, and record counts of, coincidences in each sampled subset of the data 
10 set, a coincidence being the co-occurrence of a plurality of attribute values in 

one or more objects in a sampled subset of the data set, where the plurality of 
attribute values is the same for each occurrence, the detecting and recording 
counts of coincidences in each sampled subset being performed before, at the 
same time or after sampling, detecting and recording counts of coincidences 
15 in other subsets; 

determine an expected count for each coincidence of interest, the determining 
being performed before, at the same time, or after sampling, detecting and 
recording, 

compare, for each coincidence of interest, the observed count of coincidences 
20 versus the expected count of coincidences, and from this comparison 

determine a measure of correlation for the plurality of attributes for the 
coincidence, and 

report a set of k-tuples of correlated attributes, where a k-tuple of correlated 
attributes is a plurality of attributes for which the measure of correlation is 
25 above a respective pre-determined threshold. 

In any of its aspects the methods of the invention may further comprise the step of 

representing the objects and attributes in a matrix of objects versus attributes prior to 
sampling the data set, the data set being sampled by sampling the matrix. 
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In a ninth aspect the invetion provides a product having a set of attributes selected by: 

sampling a subset of a data set representing objects versus attributes for a 
predetermined number of iterations, each iteration the sampled subset having 
the same number of objects although not necessarily the same objects and 
having for each object the same subset of attributes, 

• detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
recording counts of coincidences in each sampled subset being performed 
before, at the same time or after sampling, detecting and recording counts of 
coincidences in other subsets, 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
detecting and recording, 

• comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence, and 

reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 

In a tenth aspect the invention provides a product defined by applying a set of rules 
generated from: 

sampling a subset of a data set representing objects versus attributes for a 
predetermined number of iterations, each iteration the sampled subset having 
for each object the same subset of attributes, 

detecting and recording counts of coincidences in each sampled subset of the 
data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
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recording counts of coincidences in each sampled subset being performed 
before, at the same time or after sampling, detecting and recording counts of 
coincidences in other subsets, 

determining an expected count for each coincidence of interest, the 
5 determining being performed before, at the same time, or after sampling, 

detecting and recording, 
♦ comparing, for each coincidence of interest, the observed count of 

coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
10 attributes for the coincidence, and 

reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 

In any aspect the methods of the invetion may further comprise the step of applying rules 
15 that are defined by the reported correlated attributes. 

In an eleventh aspect the invention provides a peptide or peptidomimetic including a 
structural motif of the V3 loop of HTV envelope protein including spatial coordinates of 
residues A18/Q31/H33. 

In a twelfth aspect the inventions provides a pharmaceutical composition comprising a ligand 
20 that interacts with a protein having a structural motif identified using the method of claim 2, 
and a pharmaceutical^ acceptable carrier or exicipient therefor. The ligand may comprise 
chemical moieties of suitable identity and spatially located relative to each other so that the 
moieties interact with corresponding residues or portions of the motif. The ligand, by 
interacting with the motif, may interfere with function of a region of the protein comprising 
25 the motif. 

In a thirteenth aspect the invention provides a diagnostic agent comprising a ligand that 
interacts with a protein having a structural motif identified using the method of the earlier 
aspects of the invention, and a detectable label linked to the ligand. 
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In a fourteenth aspect the invention provides a pharmaceutical composition for interacting 
with an envelope protein of human immunodeficiency virus (HIV), the envelope protein 
including a structural motif of the V3 loop having spatial coordinates of residues 
A18/Q31/H33, comprising a ligand including at least one functional group that interacts with 
5 the motif, and a pharmaceutical^ acceptable carrier or exicipient therefor. The ligand may 
include at least one functional group capable of binding to and being present in an effective 
position in said ligand to bind to residue 18, at least one functional group capable of binding 
to and being present in an effective position in said ligand to bind to residue 3 1 , and at least 
one functional group capable of binding to and being present in an effective position in said 
10 ligand to bind to residue 33. 

In a fifteenth aspect the invention provides a method of designing a ligand to interact with a 
structural motif of an envelope protein of human immunodeficiency virus (HIV), the method 
comprising the steps of: providing a template having spatial coordinates of residues A 18, 
Q31 and H33 in the V3 loop of HIV envelope protein, and computationally evolving a 
chemical ligand using an effective algorithm with spatial constraints, so that said evolved 
ligand includes at least one effective functional group that binds to the motif. The ligand 
may comprise at least one functional group capable of binding to and being present in an 
effective position in said ligand to bind to residue 18, at least one functional group capable 
of binding to and being present in an effective position in said ligand to bind to residue 31, 
and at least one functional group capable of binding to and being present in an effective 
position in said ligand to bind to residue 33. 

In a sixteenth aspect the invention provides a method of identifying a ligand to bind with a 
structural motif of an envelope protein of human immunodeficiency virus (HIV), the method 
comprising the steps of: providing a template having spatial coordinates of A18, Q31 and 
25 H33 in the V3 loop of HIV envelope protein; providing a data base containing structure and 
orientation of molecules; and screening said molecules to determine if they contain effective 
moieties spaced relative to each other so that the moieties interact with the motif. A first 
moiety of the molecule may interact with residue 18, a second moiety of the molecule 
interacts with residue 31 and a third moiety of the molecule interacts with residue 33. 
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In a seventeenth aspect of the invention the invetion may provide antigens and vaccines 
embodying the covarying k-tuples described herein. 

In an eighteenth aspect the invention provides a product being defined by its interaction with 

a set of attributes selected by: 
5 • sampling a subset of a data set representing objects versus attributes for a 

predetermined number of iterations, each iteration the sampled subset of the 
data set having the same number of objects although not necessarily the same 
objects and having for each object the same subset of attributes, 
detecting, and recording counts of, coincidences in each sampled subset of 
10 the data set, a coincidence being the co-occurrence of a plurality of attribute 

values in one or more objects in a sampled subset, where the plurality of 
attribute values is the same for each occurrence, the detecting and recording 
counts of coincidences in each sampled subset being performed before, at the 
same time or after sampling, detecting and recording counts of coincidences 
15 in other subsets, 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
detecting and recording, 

comparing, for each coincidence of interest, the observed count of 
20 coincidences versus the expected count of coincidences, and from this 

comparison determining a measure of correlation for the plurality of 
attributes for the coincidence, and 

reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure ojT 
25 correlation is above a pre-determined threshold. 

In any of the aspects the objects may be compounds and the attributes may comprise 
particular chemical moieties. The objects may be peptides or proteins and the attributes may 
comprise particular structural or substructural patterns or motifs. The objects may be 
selected from the group consisting of compounds, molecular structures, nucleotide 
30 sequences and amino acid sequences and the attributes may be features of the selected 
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objects. The objects may be time slices and the attributes may be biological parameters of 
genes or gene products. The objects may be documents that are electonically stored and/or 
electronically indexed and the attributes may be topics. The objects may be customers and 
the attributes may comprise products purchased or not purchased by those customers. The 

5 attributes may further comprise mailings made or not made to the customers. The objects 
may comprise products and the attributes may comprise customers that have or have not 
purchased those products. The attributes may further comprise demographic variables of the 
customers. The objects may be people with a particular disease or disorder and the 
attributes may be potential contributing factors for the disease or disorder. The objects may 

10 be people with a number of different diseases or disorders and the attributes may be potential 
contributing factors for the diseases or disorders. The objects may comprise factors 
potentially contributing to a disease or disorder and the attributes may be people with or 
without those factors, in which case the method associates groups of people of substantially 
equivalent risk for the disease or disorder. 

1 5 The objects may be time slices and the attributes may comprise the state of components in a 
system at time slices prior to failure of the system, in which case the method associates 
component states that may potentially cause failure of the system. 

In the first aspect ^ may be the same for every iteration. 

In any of the aspects the method provided may further comprise the steps of first creating a 
20 database of transitions between system states, wherein a system state is represented by a 

value of a state variable, over a chosen time quantum, and presenting the database, in whole 
or part, as a data set such that each state to state transition set corresponds to one of M 
objects and so that each state variable corresponds to an attribute. 

In any of its aspects the method provided may further comprise the steps of first creating a 
25 database of states and actions covering a chosen time quantum and presenting the database, 
in whole or part, as a data set such that each state/action/state triple corresponds to one of 
M objects and so that each state variable or action type corresponds to an attribute. 
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In a nineteenth aspect the invention provides a coincidence detection method for use with a 
data set of objects having a number of attributes represented in a matrix of objects versus 
attributes, the method comprising the steps of: 

sampling a subset of the matrix for a predetermined number of iterations, 
5 each iteration the sampled subset of the matrix having for each object the 

same subset of attributes; 

detecting, and recording counts of, coincidences in each sampled subset of 
the matrix, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the matrix, where the 

10 plurality of attribute values is the same for each occurrence, the detecting and 

recording counts of coincidences in each sampled subset being performed 
before, at the same time or after sampling, detecting and recording counts of 
coincidences in other subsets; 
• determining an expected count for each coincidence of interest, the 

1 5 determining being performed before, at the same time, or after sampling, 

detecting and recording; 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
20 attributes for the coincidence; and 

reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 

In the first aspect numerical correlation values may be reported along with the set of k-tuples 
25 of correlated attributes. 
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BRIEF DESCRIPTION OF DRAWINGS 

For a better understanding of the present invention and to show more clearly how it 
may be carried into effect, reference will now be made, by way of example, to the 
accompanying drawings which show the preferred embodiment of the present invention and 
5 in which: 

Figure 1 is a depiction of a power set of a set with N=6 objects, arranged as a lattice 
under a subset operation, representing all possible K-triples of coluns from the power set. 

Figure la is a depcition of the relative portions of all lattice nodes shown (dark 
squares) or omitted (light squares) by Figure 1 . 

10 Figure 2 is a depiction of n-grams for all sizes n = 1,2,.. .,6 for the power set of 

Figure 1. 

Figure 2a is a depiction of the relative portion of all lattice nodes shown or omitted 
in Figure 2 with a subset of the terms highlighted. 

Figure 3 is a depiction of all possible pairwise correlations for the power set of 
15 Figure 1, corresponding to analysis of the third tier up from the bottom of the lattice. This is 
a shortcut taken in work on inter-residue correlations in protein and RNA sequence families, 
for example. In another example, this Figure represents the approach taken by a method 
that simply finds all pairs of sales items that tend to be purchased together by consumers. 

Figure 3 a illustrates the relevant correlations from Figure 3 out of the powerset of 
20 Figure 1. 

Figure 4 is a depiction of a partition of the variables of the objects of the power set 
of Figure 1 . A partition is one particular and important kind of componential model of a 
sequence family or other aligned dataset. In a componential model, a set of N Y latent^ 
variables is found to "generate" or "explain" a larger set of N observable variables q. In a 
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partition model, N Y <> N, each Cj is generated by exatly one of the y„ and typically N Y < N. 
The observables corresponding to one latent variable form a kind of clique, and presumably 
are highly correlated with each other and relatively uncorrelated with variables outside the 
clique. In Figure 4, the observables are formed into three cliques: (C u (C 2 , C 5 , C 6 ), and (C 3 , 

Figure 4a illustrates the partition of Figure 4 out of the power set of Figure 1. 

Figure 5 is a depiction of three iterations of sampling of a dataset in accordance with 
one embodiment of the invention. 

Figure 5 A is a depiction of the three iterations of sampling of Figure 5 with 
10 explanatory notes, 

Figure 6 is a general flow diagram of a program method of a preferred embodiment, 

Figure 7 is a schematic diagram of a system implementing the program method of 
Figure 6, 

Figure 8 is a general flow diagram of the program method of Figure 6 adapted to 
15 control a process for production of a product, 

Figure 9 is a schematic diagram of a system implementing the adapted program 
method of Figure 8, 

Figure 10 is a general flow diagram of the program method of Figure 6 adapted to 
generate rules for a rules based system that in turn produces a product, 

20 Figure 1 1 is a schematic diagram of a system implementing the adapted program 

method of Figure 10, 
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Figure 12 is a general flow diagram of the program method of Figure 6 adapted to 
generate rules used to control a process for production of a product, 

Figure 13 is a schematic diagram of a system implementing the adapted program 
method of Figure 12, 

5 Figure 14 is a diagram of a node of a hardware implementation of a preferred 

embodiment. 

Figure 15 is a diagram of residues for given sequences for the sample 3D structure of 
Figure 15a where coincidence of sequences may indicate conserved? physical or structural 
relationships. 

10 Figure 15a is a diagram of a 3D structure for a sample protein. 

Figure 16 is a diagram of steps in tertiary structure prediction which can employ the 
methods described herein. 

MODES FOR CARRYING OUT THE INVENTION 

As previously set out, a base method described herein employs the steps of: 

15 • representing a set of M objects in terms of a number N A of variables 

("attributes"), where an attribute is said to occur in an object if the object 
possesses the attribute; 

sampling a subset of t { out of the M objects, for each iteration among a 
predetermined number of iterations; 
20 • detecting and recording coincidences among sets of k of the attributes in each 

sampled subset of objects, a coincidence being the co-occurrence of 1 <. k ^ 
N A attributes in the same hj out of r, objects in the sampled subset, where 0 ^ 
^ * r ; ; 
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determining an expected count of coincidences for any set of k attributes and 
a predetermined number of iterations of sampling and coincidence-counting 
as described above, the determining being performed before sampling and 
collecting, at the same time or after sampling and collecting; 
5 • comparing, for any set of k attributes and number of iterations of sampling 

and coincidence-counting, the observed count versus the expected count of 
coincidences, and from this comparison determining a measure of correlation 
(or association, or dependence) for the set of k attributes; and 
reporting a set of k-tuples of correlated attributes, where a k-tuple of 
10 correlated attributes is a set of k of the N A attributes which have been 

determined by this process to have a value for a chosen correlation measure 
above a predetermined threshold value. 

base method can include the following steps: 

sampling a subset of the data set for a predetermined number of iterations, 
each iteration the sampled subset of the data set having for each object the 
same subset of attributes; 

detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
recording counts of coincidences in each sampled subset of the data set being 
performed before, at the same time or after sampling, detecting and recording 
counts of coincidences in other subsets; 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
detecting and recording; 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 
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reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 

The modes described herein provide extensions to the base methods described above 
and employ similar principles. The principles of one application as described herein may be 
applied to the others as appropriate. Thus, the description of all elements of an application 
will not always be repeated for each application. 

In the preferred embodiment it is preferred for simplicity of programming and 
interpretation to use a matrix where the objects are rows and the attributes are columns; 
however, this is not strictly required and any of the embodiments can utilize a data set of 
objects and attributes that are not represented in the form of a matrix by sampling subsets of 
the data set directly. As known to persons skilled in the art, any relational database can be 
easily transformed into a 2-dimensional matrix format. 

The embodiments described herein lend themselves particularly well to parallel 
processing as the steps of detecting, recording and counting coincidences for each of the r 
samples can be performed simultaneously across many different samples or other subsets of 
the data set. 

Each of the features or variables describing an object may be numerical or 
qualitative. If qualitative, a feature or variable described in terms of some number z of levels 
or qualities may be transformed into a numerical variable with z possible values or states. A 
numerical variable with z possible values or states may be transformed into z binary 
variables, termed attributes. A numerical variable or feature with a continuous range of 
possible values or levels may be transformed into, or represented by, a variable with z 
possible values or states and therefore may also be transformed into, or represented by a set 
of z binary attributes. 

More formally, assume that we are given a database of M objects O u 0 2 ,...,O M each 
of which is characterized by particular values a^eAj for each of N discrete-valued variables 
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v f A particular value for a particular variable is denoted a, @v y . One may start with 
continuously-valued variables and use any of several known methods to quantize them into 
discrete variables. We also note that, in many applications, the same alphabet A of possible 
values is used for all the variables. Each object might be a particular record in a database, or 
5 may be a sample from a random source. 

If the initial N variables are not binary then they can be converted into a set of N A 
attributes. For example, in the input listing attached in Appendix "B" each amino acid 
position is a variable that has 20 possibilities corresponding to the 20 naturally occurring 
amino acids represented by a subset of letters from the alphabet. In order to turn the 
10 variables into binary attributes, each variable becomes 20 different attributes having 1 of 2 
states, such as "A" or "not A", "B" or not "B", and so on. An embodiment for representing 
variables of this type is included in the source code listing in Appendix "A". Other 
techniques for representing data as attributes could be used. 

The principles set out in this description can also be extended to higher orders of 
15 attributes, for example trinary attributes to be used with higher order computing machines. 
The binary examples used herein are the simplest to implement. 

This situation can be represented by a table in which each row stands for an object, 
each column stands for an attribute, and in which therefore each table entry a 0 stands for the 
fact of the /th object having value written at a i} for the yth variable. We can also write c, (for 
20 "column /') and an attribute as a t @c r 

For example, consider this small matrix of six rows (objects) and six columns 
(variables). 
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Object number 1 has value 'A* for variable 1, *B* for variable 2, 'C for variable 3, 
and so on. For some applications, it might be useful to find out that, for example, variables 
2 and 4 are correlated. In the toy (small fictional) matrix example above, this correlation 
appears plausible, because whenever an object has B@2, it also has D@4; whenever an 
5 object has L@2, it has M@4; and whenever an object has U@2, it also has V@4. Attribute 
number 3 does not vary - every object has the attribute C@3, and therefore it does not 
correlate in an interesting way with any other variable. 

Given a matrix of data, we further assume that there is some "true" underlying 
probability distribution q{ ) which, for all orders k = 1, 2, . . N A specifies the probabilities 
10 for each possible k-tuple of attributes. For example, for k = 1, we have <?(c y ) : Aj - [0, 1], 
and we might have for some dataset q(B@2) = 0.33. A distribution also specifies higher- 
order probabilities, like, for example, q(B@2, F@6) = 0. 166. Inherent in the particular 
problems posed is the problem of estimating or approximating the distribution q( ), or at 
least parts of it. 

The problem is to find some, or all, k-tuples of columns (c, 7 , c J2y . . c jk ) y for k=2 . 
,N Ay whose correlation is greater than some predetermined value. For example, one may 
want a procedure which, given an M-by-N table of values, returns a list of k-tuples of 
column indices (jj 2 , . . . y j k) such that £>(?(v y7 , v j2j . . ., v jk | II M>Jk q(y^) > p k for some real 
number p k . Here D{p 1 \p 2 ) is the Kullback divergence measure, which in this case estimates 
the difference between the observed distribution of values over the column variables versus 
the distribution wherein all the column variables are statistically independent. The Kullback 
measure is just one of many possible measures of correlation or association applicable to this 
type of problem. 

For our purposes we consider correlation in terms of deviation from statistical 
independence. One can compare an observed number of occurrences of some event in 
viewing the database versus the number expected if an underlying hypothesis of independent 
variables were true. That is, the problem is: Given the table of values, for all k= 2. .N A , 
return a list of all k-tuples of attributes (a n @c n , a a @c a ,. . . , a ik @c ik ) such that 
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P(Observed{a a @c,„ a a @c a ^ . . , a ik @c ^Independent (c iU c ai . . . , c A ), Model) < 

for some observed behaviour of {a a @c iU a a @e&. . . , a* @c*), for some real number 
threshold 0, e [0, 1], and some Model which underlies one's estimation or hypothesis testing 
5 method. 

The sampling subprocess may be random sampling, and if random it may be subject 
to any of a number of possible probability distributions over the objects, including a uniform 
distribution. Similarly, there may be constraints on the statistical independence or 
dependencies between each of the T samples drawn during the operation of the method, and 
10 between each of the r objects drawn within one sample. 

Sample Advantages of Preferred Embodiments 

There is at least one class of problems, arising in many diverse application areas, on 
which the comparative advantages of the coincidence detection method and apparatus 
described above and further to be described below are most apparent. Such problems are 
1 5 characterized by ; 

1 . a large number of attributes (columns, in our representation); 

2. the possible existence of some number of cliques of highly mutually 
correlated attributes in the dataset, each member attribute of each such clique being 
relatively uncorrelated with attributes outside its own clique; and 

20 3 . lack of prior knowledge as to the precise number, width (£, as in k-ary 

correlation and /rth-order feature), and location of such attribute cliques. 

All other procedures of which we are aware either place prior limitations on the 
width k of discoverable k-tuples, or implement an exhaustive search, serial or parallel, over 
all or nearly all possible k-tuples of attributes. To put it more simply, the method of the 
25 preferred embodiment takes approximately the same computation time and memory to find a 
44-ary correlation as it takes to find a 2-ary correlation in the same very high dimensional 
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dataset. Most prior methods, in contrast, either rule out the discovery of the 44th-order 
feature or else require the allocation of orders of magnitude more time or space in order to 
find it. 

Sample Applications of Preferred Embodiments 

5 Modellers of very large data sets are thwarted in their attempts to compute very far 

into a fully higher-order probabilistic model by both the computational complexity of the 
task and by the lack of data needed to support statistically significant estimates of most of 
the higher-order terms. 

The preferred embodiment computes only a subset of higher-order probabilities, and 
10 extracts a limited selection of higher-order features ("HOFs") for construction of a database 
model. Efficient use can be made of limited computing resources by pre-selecting sets of 
higher-order features using the correlation-detection methods described herein, and building 
the most significant (statistically and in terms of application-specific criteria) into model- 
based classifiers and predictors based on existing statistical, rule-based, neural network, or 
15 grammar-based methods. The pre-selected sets of HOFs can be used to create rules for such 
systems. For example, a data set may be analysed using the methods set out herein to 
determine that if a company is filing a patent application then it should file an assignment 
from the inventor. This rule is then used in the system to generate assignments whenever it 
is determined that a company is filing a patent application. Many rule-based networks could 
20 benefit from pre-processing using the methods described herein, see for example, the System 
and Method for Building a Computer-Based Rete Pattern Matching Network of Grady et al. 
described in U. S. Patent Number 5,159,662 issued October 27, 1992; the inference engine 
of Highland et al. described in U. S. Patent Number 5,1 19,470 issued June 2, 1992; and the 
Fast Method for a Bidirectional Inference of Masui et al. described in U. S. Patent Number 
25 5,179,632 issued January 12, 1993. 

The discovered HOFs can alternatively be used directly to create products, for 
example, in the prediction or determination of protein structure, when fed into existing 
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methods based on distance geometry or empirically-estimated patterns of cooperativity and 
folding, or in marketing schemes based on correlated product sales information. 

Later below, practice of the principles described herein using the Los Alamos HTV 
Database is described. In particular, the principles were applied to study of the V3 loop of 
5 envelope proteins of human immunodeficiency virus (HTV). In biochemistry and molecular 
biology in general, covariation of particular residues of a protein likely indicates the 
existence of a structural motif characterizing a region of the protein that has a functional, 
physiological role. 

Envelope proteins are partially embedded in the lipid membrane surrounding a virus 
10 particle, and project externally from the lipid. When the lipid of an HIV particle fuses with 
the membrane of a host cell during infection, envelope proteins may also protrude from the 
membrane of the infected cell. The V in V3 stands for "variable", as the sequence of the V3 
loop is highly variable between different virus isolates. 

Previously, a Los Alamos group in B.T.M. Korber, R.M. Farber, D.H. Wolpert and 
15 A S. Lapedes, "Covariations in the V3 loop of fflV-1 : An information-theoretic analysis", 
Proc. Nat. Acad. Sci. U.S.A. 90 (1993), the disclosure of which is hereby incorporated 
herein by reference, described 2-ary covariation mutations in certain residues of the V3 loop 
of HIV 1 envelope proteins. Practice of the present principles has confirmed some of the 
Los Alamos group's results, but has further permitted the discovery of other highly 
20 covarying groups of residues. Whereas the Los Alamos group could only discover pairwise 
covariation, we describe herein k-ary residue covariation, where k> 2. That is, we have 
identified previously unrecognized motifs of HIV envelope protein. 

For a particular trial, input consisted of the respective amino acid sequences of V3 
regions from 657 different virus isolates, and is shown in Appendix "B". Source code used 
25 on the input is shown in Appendices "A" and "D", named "File coinc.pl" and "File 
probsort.pl", respectively. Output is shown in Appendix "C". 
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Referring to Tables C.l through C.9 set out elsewhere below, the results of 6 
separate trials are shown. Parameter values are as indicated in the respective legends. In each 
Table, the results are ordered by statistical significance, with the most significant correlation 
first, and the standard one-letter amino acid code is employed. Thus, referring to Table C.6, 
5 the most significant coincidence observed is the occurrence of alanine (A) at residue 18, 

glutamine (Q) at residue 31, and histidine (H) at residue 33. This, like other coincidences set 
forth on the cited pages, represents the identification of a structural motif of the HIV-1 V3 
loop which comprises these residues. 

Continuing with the particular example of A18/Q31/H33, the V3 structural motif 
10 comprising these residues presumably exists on the exterior of the virus particle, and that 
region of the V3 loop likely performs a specific function which requires the particular 
structural motif Thus, the structural motif would have to be conserved after mutation(s) to 
preserve that function. This reasoning is extended to other coincidences identified herein. 

The identification of a particular conserved structural motif of HIV has several uses. 

15 Using techniques known in the art, a peptide embodying the motif could be produced 

for use as an antigen. Accordingly, a vaccine could be prepared. The peptide embodying the 
motif might be made using known recombinant methods, as are described generally, for 
example, in Maniatis et al., Molecular Cloning: A Laboratory Manual, Cold Spring Harbor 
Laboratory, Cold Spring Harbor, NY (1982) and in Sambrook et al., Molecular Cloning: A 

20 Laboratory Manual (2 nd Edition), Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 
(1989). Alternatively, the peptide or a peptidomimetic might be chemically synthesized 
using standard chemical techniques. Monoclonal antibodies to the peptide or 
peptidomimetic could be generated using standard methods, as described for example, in 
Harlow, E and Lane, D., Antibodies: A Laboratory Manual, Cold Spring Harbor 

25 Laboratory, Cold Spring Harbor, NY (1988), Fragments of such monoclonal antibodies, for 
example, F ab fragments, that have specific affinity for the novel structural motif could also be 
generated. 
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In another embodiment, a ligand that interacts with a structural motif identified 
according to the invention could be generated. That is, the ligand would be characterized by 
having chemical moieties of suitable identity and spatially located relative to each other so 
that the moieties interact with corresponding residues or portions of the motif. In some 

5 embodiments, the ligand could be an agent, eg. a drug) that, by binding to the motif; 

interferes with function of the region. The ligand would therefore be an HTV antagonist 
with potential therapeutic utility. Alternatively, the ligand could bind to the particular V3 
region comprising the identified motif, providing diagnostic utility. Such diagnostic utility 
can be ex vivo. A ligand with diagnostic utility (e.g., an antibody) might comprise a label, 

10 such as a fluor or an enzyme conjugate for use in a colorimetric reaction. Fluorescence- 
labelled viruses or virus-infected cells could be visualized or counted using fluorescence 
microscopy or FACS (fluorescence-activated cell sorting). 

Methods of designing and identifying ligands that bind to structural motifs identified 
according to the invention are also provided by the invention. 

15 Thus, in one embodiment, the invention provides a ligand for binding with an 

envelope protein of human immunodeficiency virus (HTV), wherein the envelope protein 
includes a structural motif comprising amino acid residues A18/Q31/H33. The ligand 
includes at least one functional group capable of binding to the motif. In a preferred 
embodiment, the ligand includes at least one functional group capable of binding to and 

20 being present in an effective position in said ligand to bind to residue 18, at least one 
functional group capable of binding to and being present in an effective position in said 
ligand to bind to residue 3 i, and at least one functional group capable of binding to and 
being present in an effective position in said ligand to bind to residue 33. 

In another embodiment, the invention provides a method of designing a ligand to 
25 bind with a structural motif of an envelope protein of human immunodeficiency virus (HIV). 
The method includes providing a template having spatial coordinates of A18, Q31 and H33 
in the V3 loop of HIV- 1 envelope protein, and computationally evolving a chemical ligand 
using an effective algorithm with spatial constraints, so that said evolved ligand includes at 
least one effective functional group that binds to the motif. In a preferred embodiment, the 
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ligand includes at least one functional group capable of binding to and being present in an 
effective position in said ligand to bind to residue 18, at least one functional group capable 
of binding to and being present in an effective position in said ligand to bind to residue 3 1, 
and at least one functional group capable of binding to and being present in an effective 
5 position in said ligand to bind to residue 33 . 

In another embodiment, the invention provides a method of identifying a ligand to 
bind with a structural motif of an envelope protein of human immunodeficiency vims (HIV). 
The method includes: providing a template having spatial coordinates of A18, Q31 and H33 
in the V3 loop of HIV- 1 envelope protein; providing a data base containing structure and 
10 orientation of molecules; and screening said molecules to determine if they contain effective 
moieties spaced relative to each other so that the moieties interact with the motif. In a 
preferred embodiment, a first moiety of the molecule interacts with residue 3 1, a second 
moiety of the molecule interacts with residue 31 and a third moiety of the molecule interacts 
with residue 33. 

15 The principles described herein encompass similar respective embodiments, including 

antigens and vaccines, for the other covarying k-tuples described herein, that is, both 
residues of the V3 loop that covary, and particular amino acids at certain residues that 
covary. 

The method of the current invention can be viewed as a "high-pass filter" for 
20 detection of higher-order features. Such HOFs play an important role in database modelling, 
machine learning, and perception and pattern-recognition. In database mining and modelling 
contexts, a procedure for discovery of these features might serve any of several major roles, 
including; 

1 Preprocessing of large, complex datasets: Many of the best modelling 
25 methods, including Gibbs models, Hidden Markov Models and EM, MacKay's 

density networks, and related factorial learning methods from the neural network 
community, could be helped significantly in capturing higher-order interactions 
without exhaustive search or combinatorial explosion of parameter space if preceded 
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by a fast preprocessing procedure, such as one provided by implementing the 
principles described herein, that found plausibly correlated variables in the database. 

2. Visual exploration of large complex data sets: If coupled to even a simple 
graphical display interface, a procedure such as ours permits a user to view quickly 

5 (with small number of r-samples) the most plausibly interesting higher-order features 

in high-dimensional data. 

3. Pre-conditioning and redundancy elimination: Thus far, we have stressed 
the utility of finding inter-attribute correlations in order to use them in the building of 
models; but in many optimization, learning and data-fitting applications, one requires 

10 that correlations between variables be found and eliminated, through any of a 

number of subspace methods like principal components analysis (PCA). 

An Embodiment Using a Programmable Digital Computer 

Components for Digital Computer Embodiment 

Data Matrix, Sampling, and Coincidences. Given a set of M objects, each of 
15 which has either a "Yes" (representable by 1) or "No" (representable by 0) value for each of 
a fixed set of N A attributes, the input dataset can be arranged into an M-by-N A table of 
values, which we shall call the data matrix or simply matrix, and this matrix, as well as its 
sub-matrices and related vectors that comprise functional parts of the system/process 
described below, are stored in memory locations within a programmable computer. In this 
20 representation the rows of the matrix correspond to objects, and the columns correspond to 
attributes. The matrix may be labelled as K ff and each element of this two-dimensional table 
labelled by v y e {0, 1 }, where / refers to the /th object (row) o f and j refers to theyth 
attribute (column) a,. The set of objects may be listed, for the purposes of this description, 
as O = o u o 2 , . . o M and the set of attributes may be listed as A = a u a 2 , a NA . 



-36- 

SUBSTTTUTE SHEET (RULE 26) 

BNSDOCIO: <WO 9843162A1_IA> 



WO 98/43182 



PCT/CA98/00273 



10 



15 



Figure 5A illustrates these terms as applied to the example illustrated in Figure 5 
discussed in more detail below with regard to the program method description of a preferred 
embodiment. 

A particular attribute a, may be said to occur in a particular object (row) i if o jy - 1 

Given an ordered list of 1 s m s hA objects (rows) 5, an incidence vector 2Jot an 
attribute aj may be defined as the binary vector or string of length m such that the £th bit is 1 
if and only if the attribute e, occurs in the £th object in the given iist of objects. The 
incidence vector 2 is a simple representation of the pattern of occurrence of the attribute 
over some set of objects, for example, the set of all A/objects or the set of objects 
corresponding to one r-sample as described below 

An r-sample, for example the three rows identified by reference numeral 4 in 
Fi gure SA, is a set of r of the M records drawn randomly from some probability distribution. 
In some preferred embodiments, the rows within sample are considered to be drawn 
independently from a uniform distribution. 

The drawing of an r-samptes sample 4 is performed by the system one time within 
each of a specified number of iterations. In some preferred embodiments, the samples drawn 
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over the total number T of iterations are considered to be drawn independently from a 
uniform distribution. 

Is some preferred embodiments, different values of r are used for different sequential 
iterations of the sampling, and/or for different subsets of the dataset processed by different 
5 processing nodes in a parallel computing embodiment. In such cases, we may say that on 
the ith iteration or in the ith sample, the number of objects sampled is r, . Some advantages 
of using different sample sizes include: the ability to try, within one run-through of the 
method, different values of r when one is unsure which values of r are best; and the ability 
to pick different values of r for different processing nodes in a parallel computing 
10 embodiment, in order to make optimal use of different processor sizes/speeds and memory 
sizes among the different processing nodes. An advantage of using the same, single value of 
r throughout a run-through of the method is the slight gain in simplicity of the program 
code. 

A coincident set, or cset, may be defined as a pattern comprising the joint 
is appearance of 1 £ k s N A attributes (columns) 1 within some set of objects (rows)JL That 
is, given some one or more rowsjS_under consideration, there is a cset a fl , . \fa jit 
ajj, ... and all occur in the given row or rows. riuun J At For example, elements 
A@ cl, B@c2, D@c4 identified by reference numeral 3 in Figure SA are a coincidence 
set (cset). 

20 Within the computer memory is stored a data structure termed the cset table, which 

is a means for storing the identity and occurrence count for each cset that occurs in one or 
more iterations within the process. The identity of a cset is a list Df attributes (columns) 
comprising the cset; the occurrence count is a number corresponding to the number of 
occurrences of a cset that have been observed up to a particular iteration within the process, 

25 or at the end of all the iterations. In some preferred embodiments, the cset table is 
implemented as a hash table stored in a computer memory 

A cset has, for a given r-sample, a particular incidence vector, which is its 
binary-encoded record of occurrences (denoted by 'J ^«b* non-occurrences ('0*) over the r 
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data items in the sample. Therefore a cset, corresponding to a set of k attributes, may have 
an associated incidence vector; and an individual attribute may have an associated incidence 
vector 

A match {or coincidence) of size h is said to occur, in a given r-sample, for a given 
5 cset a = (a n , . , when a a appears in h out of the r records, • • *, and a ik appears in h out 
of the r records, and they all appear in exactly the same h out of r records (See Figure 5 A) 

Observed Counts of Coincidence* The coincidences are observed, and the 
corresponding csets stored or updated, by means of a binning method. In each iteration, the 
attributes are binned, that is, placed into separate subsets according to their incidence 
10 vectors 2 ov er the /--sample 4 for the current iteration. In this described matrix-based 

embodiment of the invention, these vectors act like r-bit addresses into a very sparse subset 
of T address space. (See Figures 5 and 5 A). 

All the attributes in one bin constitute a cset The cset is recorded: if the particular 
cset has occurred in a previous iteration, then its count of occurrences is updated; if it has 
not occurred previously, then an entry in the cset table is created for it, and then its 
occurrence count is updated. In this described embodiment, the system stores the number h 
: 0 £ h * r of occurrences for this and each iteration After a specified number Tof 
iterations has been completed, the cset table contains a list of all the csets observed, and, for 
each cset a, a total number of observed coincidences* which corresponds to 2^, /?X°0> 
where A/a) is the number of joint occurrences for the k attributes comprising a, for the /th 
iteration. 

Expected Count Function. An expected count function is a mathematical function, 
implemented as a computer program or subroutine, or in electronic or optical circuits, which 
takes a set of attributes a p a j2 , . . and a number Zand produces a number 
25 corresponding to an expected number of coincidences for that set of attributes in a process 
of T iterations of drawing of r-samples and observing coincidences. 
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In one particular embodiment of the invention, the function f match (a, h f) is obtained 
from the multinomial distribution: 

f match («, *,r) = ( Wrhi)\ )p(a iU . . a^pft,, Z*T h ), 

This formula gives an estimate of the probability for finding exactly h occurrences of 
a iU h occurrences of a a > . . and h occurrences of a iks all occurring in the same h rows, in 
one r-sample. 

(This function definition has a simple form because all but two of the large number of 
pQ factors in the standard multinomial expression vanish with zero exponents.) 



10 The probability of a match of size h for the k attributes which make up a potential 

cset has been defined in terms of the joint probability p(a iU . . a,*); the Expected Count 
Function must employ particular estimates for these joint probabilities. In this preferred 
embodiment, the joint probability estimates incorporate the hypothesis of independence 
between the individual attributes. Therefore in the definition formula given above we 

15 substitute n,* M p(<i it ) for/?( a,„ a ik ) and 11/^ (l-pfa/)) for/?(a n , . a*)- 

Hypothesis Test Function and Correlation Measure. An hypothesis test is a 
mathematical procedure, implemented as a computer program or subroutine, or in special 
purpose electronic and/or optical hardware, which takes a pair of number and 
representing the expected and observed numbers of coincidences, respectively, for a 
20 particular set of k attributes, and produces a number C representing an estimate of the 
correlation among the k attributes. 

In some preferred embodiments, a Chernoff bound on tail probabilities provides the 
hypothesis test function, as described below. 



Let random variable X t hold the value /i, for each iteration /, and let X = 2 r , =l X h and 
25 note that 0 < X < T • r. The method of ChernofF-Hoeffding bounds [8] provides the 
following theorem: 
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Let random variable X i hold the value /?, for each iteration /, and let X = E r , =1 X h and 
note that 0 <> X <> T - r. The method of ChernofF-Hoeffding bounds [8] provides the 
following theorem: 

Let X = X x + X 2 + • * *+ X n be the sum of n independent random variable s, where /, < 
5 Xt* w, for reals /, ("lower") and w, ("upper"). 



Then 

.25 2 

p[x- E[x\ > 5] < exp ( ax«, " W ) 0) 

For our purposes, we set n = T and /, = 0 and u, = r, for all / = 1, 2, . . T, and we 
10 thereby obtain 

-26 2 

/>[JT- E[JG > S] ^ exp(I^7f ) (2) 

Using this mathematical relationship, an effective procedure for computing a 

correlation value can be defined: 

15 ^SiLb S -=Mcxp£- 

Corr(a)= 1 -exp(E,r, 2 ). 



In the special case wherein the same sample size r is used for every iteration of the 
sampling, that is, when r, = r for all / = 1, 2, . . T 9 then the above formulas reduce to the 
simpler forms: 

90 -25 2 

P[X- E[X] > 6] < expCTP* ) 

Corr (a) = 1 - exp ( Tr 2 ) . 



Here the correlation value corresponds to an estimate of 1 minus the probability of 
25 having observed coincidences, over T iterations of r-sampling, if the hypotheses 

underlying the expected count were true. If the assumption of independence between 
the attributes was used to compute as described above for some preferred embodiments, 
then this hypothesis test provides a correlation value for each cset that estimates the 
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deviation from independence; that is, it estimates the statistical dependence between the 
attributes making up the cset. 

Operation of the Components Within a Process 

Typically, the representation component is performed first within the overall process 
5 of the current invention. A plurality of sampling iterations is performed on the 
representation of the data, and for each r-sample, the detection and recording of 
coincidences is performed. The sampling iterations may be performed sequentially or in 
parallel, or in some combination of sequential and parallel steps. 

At any stage within the process, the determining of an expected count of 
10 coincidences, for some or all of the coincident sets of attributes, is performed. This 
component of the process may be performed all at once for all coincident sets, or 
incrementally; sequentially or in parallel, or in some combination. It may be performed for 
coincident sets (csets) as each coincidence is detected or stored, or may be performed before 
or after such detection or recording. 

15 After some number of sampling iterations has been performed, the comparing of 

actual to expected number of coincidences may be performed for some or all recorded 
coincident sets. This may be done for all csets at once, or for any subsets of them at 
different points throughout the process. These comparisons for different csets may be 
performed sequentially or in parallel, or in some combination thereof 

20 After some number of sampling iterations has been performed, the reporting of sets 

of correlated attributes may be performed for some or all of the recorded coincident sets that 
have been determined, in the comparisons, to signal significant correlations between the 
component attributes. This may be done for all csets at once, or for any subsets of them at 
different points throughout the process. These comparisons for different csets may be 

25 performed sequentially or in parallel, or in some combination thereof 

Program Method Description of a Preferred Embodiment 
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Below is shown, in pseudocode, a program on appropriate media, for example, a 
floppy disk, hard drive, RAM or other such media, corresponding to one possible 
embodiment on a programmable digital computer. 

Figure 5 provides a pictorial example of the applicatin of this embodiment to a 
5 fictional toy dataset. Three iterations of r-sampling (for r = 3) on the toy dataset are 

depicted, top to bottom. For each iteration, the left-hand box represents the dataset, with 
outlined entries representing the sampled rows. The right-hand-box represents the set of 
bins into which the attributes collide. For example, in the first iteration, A@l, B@2, and 
D@4 all occur in the first and second of the three sampled rows, so they each have incidence 
10 vector 110 and collide in the bin labelled by that binary address. Bins containing only a 
single attribute are ignored; and "empty" bins are never created at all. All bins are cleared 
and removed after each iteration, but collisions are recorded in the Csets global data 
structure. 

Procedure to find correlated sets of attributes: 
15 0. begin 



1. read (MATRIX); 

2. read (R, T); 

3 . compute_first_order_marginals(MATRIX); 

4. csets :={}; 

20 5. for iter = 1 to T do 

6. sampled_rows :=rsample(R, MATRIX): 

7. attributes :=get_attributes(sampled_rows); 

8. all_coincidences :=find_all_coincidences(attributes); 

9. for coincidence in all coincidences do 

25 10. if cset_already_exists(coincidence, csets) 

11. then update cset(coincidence, csets); 

12. else add_new_cset(coincidence, csets); 

13. endif 

14. endfor 
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15. endfor 

16. for cset in csets do 

17. expected :=compute_expected_match_count(cset); 

18. observed :=get_observed_match count(cset); 

19. stats :=update_stats(cset, hypoth_test(expected, observed)); 

20. endfor 

21. print_final_stats(csets, stats); 

22. end 



Steps 5 through 21 of the pseudo-code represents the steps of the base method described 
10 herein, namely: 

sampling a subset of the matrix for a predetermined number of iterations, each subset 
of attributes being the same, 

• detecting and recording counts of coincidences of attributes in each sampled subset, 
a coincidence being the occurrence of a plurality of attributes in an object in a 

15 sampled subset, where the plurality of attributes is the same for each occurrence, 

determining an expected count for each coincidence of interest, the determining 
being performed before, at the same time, or after sampling, detecting and recording, 

• comparing, for each coincidence of interest, the observed count of coincidences 
versus the expected count of coincidences, and from this comparison determining a 

20 measure of correlation for the plurality of attributes for the coincidence, and 

reporting a set of k-tuples of correlated attributes, where a k-tuple of correlated 
attributes is a plurality of attributes for which the measu/e of correlation is above a 
pre-determined threshold. 



Appendix "B" contains actual source code written in the Perl language for running on a 
25 Sun4 computer in the Sun UNIX operating system. Sample input data for the code listing in 
Appendix "B" is listed in Appendix "C M for partial amino acid sequences from V3 loop of 
HIV envelope proteins. The corresponding output from the code of Appendix "B" for the 
input of Appendix "C" is shown in Appendix M D". In order to produce the output of 
Appendix "D", the adjunct Perl language program listed in Appendix "E M was used for 
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clarification and presentation from the main code listing in Appendix "B" . A general flow 
diagram for this embodiment is shown in Figure 6, while a general block diagram is shown in 
Figure 7. The resulting report was stored in a flat file as a relatively unstructured ascii 
database, which was later printed; it could equally well have been sent to a printer directly or 
5 sent across a network for report to other resources. 

Alternative Embodiments 

Descriptions of alternative embodiments of tl ^ present invention may be divided into 
two categories, described separately below: first, different physical embodiments of the 
system/process as may be used in many potential problem-specific applications; and, second, 
10 different interpretations of the components enumerated in the description above, according 
to different problem-specific applications of the present invention. 

Different Implementations 

For example, among the many possible embodiments as programs on programmable 
digital computers: 

15 The method may be run entirely sequentially, as in the most straightforward 

interpretation of the pseudocode given above, or the method may be run on parallel (vector 
or multiprocessor) or distributed computer systems in many possible ways. A set of 
computations may be run in parallel, in which each computation performs the entire program 
steps outlined above, but with each separate computation using a different value for r, the 

20 sample size; or each separate computation could run the same program steps with same key 
parameter values, but start with different initial random number seeds for the random r- 
sampling. Alternatively, the entire program steps outlined above could be run once, but each 
different r-sample could be forked off into a separate process run on different processors, 
where in each such process would comprise the detection and optionally recording steps, 

25 with the global cset counts later joined into the global process and global data structures. 

Additionally, the computation of the expected counts, and the comparisons of expected with 
observed counts, could be performed all at once or incrementally, sequentially or in parallel. 
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Similarly, the reporting of the estimated correlation values can be performed for some or all 
of the Csets, once at the end of computation or incrementally throughout, in serial or 
parallel. 

The output of the method, which can include the reporting of the significantly 
correlated k-tuples of attributes (the csets that are deemed sufficiently highly correlated in 
the comparing, a.k.a., hypothesis testing stage), can be verbal, and/or numerical and/or 
graphical. 

A number of sampling schemes are possible, including deterministic, pseudo-random, 
or purely random. And if pseudo-random or random, any of a number of random sampling 
schemes may be used, including hypergeometric and multinomial sampling. The r objects 
within an r-sample may be sampled "with replacement" or "without replacement". At the 
next level up, the set of r samples themselves may be drawn "with replacement" or "without 
replacement". 

Different choices for the key sampling parameter r are possible, and it is not 
necessary to use the same number r for each sample. 

Many possible choices exist for T, the number of sampling iterations. It is possible to 
use any of a number of mathematical methods for choosing T in order to achieve a desired 
confidence level in the degrees of correlation estimated for the k-tuples of attributes 
discovered by the method of the current invention. Alternatively, it is possible to run the 
procedure for a given fixed number of iterations and then print or view the results, or to 
interleave the running of some number of iterations with the printing or viewing of partial 
results. 

Many possible ways exist for the representation, storage, and accessing of the Csets 
data structure used during the processing of the algorithm. The Csets data may be stored 
and accessed via a hash table, a k-d tree, patricia tree (also called a trie), and/or in other 
ways, known to those skilled in the art, of storing and accessing data efficiently. Whatever 
data structure is chosen, the structure may be stored physically in registers, in main memory, 

-46- 

SUBSTITUTE SHEET (RULE 26) 



.98431 B2A1JA> 



WO 98/43182 



PCT/CA98/00273 



and/or on secondary or external storage media such as magnetic disks, magnetic tape, or 
optical storage media. 

Alternative to the embodiments of the method on general-purpose computing 
hardware of various types, there are many possible embodiments on special-purpose 
5 electronic, optical, or electro-optical hardware, or some combination of general-purpose and 
special-purpose architectures and devices. 

For example, very efficient special purpose electronic (LSI or VLSI) may be used to 
implement the matrix representation of the current invention, by the fact that the incidence 
vectors of attributes are simple binary vectors, by the fact that the coincidence "bins", 
described earlier in one view of the current invention, correspond to "addresses" to a 
memory space of size T for each r-sample, and by the ability with current technology to 
design, fabricate and use special-purpose hardware for implementations of random-number 
generation and sampling, fast-access storage of the Csets data structures, and of the 
mathematical functions used in the calculation of expected count estimates and hypothesis 
tests and correlation estimates. 

Special Purpose Hardware Method Description of a Preferred Embodiment 
1. Overview 

Referring now to Figure 14, an embodiment of special purpose hardware mentioned 
previously is intended to exploit the potential benefits of parallelizing the execution of the 
20 algorithm. A node (defined below) divides a given data set along M (the number of rows of 
data) and distributes these portions to its CPs (also defined below). The CPs may be either 
other nodes (in a recursive definition) or may be special purpose processors developed to 
perform step 8 in the method as described in high-level "pseudo-code" in the previous 
Program Method Description of a Preferred Embodiment Section. When the results have 
25 been computed by the node's CPs, the merging step (steps 9 through 14 in the above-noted 
"pseudo-code" description) is performed by the node. Once the merging has been done, the 
results are passed back to the node's parent. If the node is the root of the tree, the complete 
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results set is sent back to the driver that controls this hardware. The system described below 
can be used "off-line" from a main computer's CPU; among other possibilities for 
commercial marketing and use of such a system is its implementation on a special "board" or 
"card" that a user can purchase and install on his or her personal computer or workstation. 
One can also envision the use of one or a number of such special subsystems on a local area 
network or a "supercomputer" installation. The described embodiment represents only one 
of many possible ways, as will be understood by those skilled in the art, to parallelize the 
methods described herein. 

This implementation described below is assumed to act solely on character-valued 
data attributes. This is in no way a limitation of the basic methods described herein, rather it 
is a specific implementation of the basic methods. The implementation could easily follow a 
binary-attribute encoding as described elsewhere herein. 

A diagram of a node is shown in Figure 14 with compute processors (CPc). The 
node includes the following: 

A bank of memory where input to be sent to the CPs is stored (the input buffer) and 
where results found by the CPs will be stored (the output buffer). 
A memory bus divided into control, data and address buses used to arbitrate 
communication on the bus itself as well as being the vehicle for data transfer. 
A set of bit flags and a small additional portion of memory (LastOut). LastOut is the 
address of the section in the output buffer that was last written to. The two bit flags 
are used by the merge and I/O processors to determine what state they each are in. 
An array of size J of compute processors (CPs), each with their own local memory 
caches, which perform the discovery of coincidences. 

A merge processor (MG) which has its own cache of memory in which it writes the 
25 merged results of the CPs. 

An input/output processor (IO) whose main responsibility is to control use of the 
bus. 
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A clock which is used to ensure that each element in the system runs synchronously 
with respect to every other element. Execution of each of the parts in the system can 
be thought of as running in lock-step. 

Computer processors are defined as being either special processors that perform the 
/J-sampling step of the algorithm (step 8 in the pseudo-code description and graphically in 
Figure 5. This allows the possibility of a tree structure of such nodes rather than limiting 
embodiments solely to a vector arrangement. For any particular choice of hardware for the 
memory bus, it may be the case that there is a maximally useful limit on the number of CPs 
per node. A tree structure allows a way around this limit. 

The implementation assumes that maximal values of method parameters RmdN 
(Umax and Nmax) are specified a priori. It is the responsibility of the software driver to 
detect when these limits have been violated and react accordingly. 

2. Bank of Memory 

For each node, memory of size 2*J*Amax*Rmax*Nmax, where Amax is the maximal 
total number of iterations that can be done in the node. This memory is divided equally into 
the input and output buffers. Note that the size of the input for a single iteration is no greater 
than J*Rmax*Nmax and neither the locally-produced results nor the final merged results 
(formed by combining the partial results from the J CPs) can exceed this limit, so there is no 
risk of exceeding available memory. 

Access to this memory is as follows: 

10 has write access to the input buffer and read access to the output buffer. 
MG has no access to the input buffer and read access to the output buffer. 
CP has read access to the input buffer and write access to the output buffer . 

3. Memory Bus 
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Control of the memory bus is the responsibility of the IO processor. Each CP is 
assigned a numeric identifier (0 to J + 7 as IO is implicitly assigned zero and MG is assigned 
1). The memory bus is divided into three sections: 

Control. Two wires for each CP, two for MG and two for IO comprise the control 
5 bus. The first of each pair is called the request wire while the second is known as the 

response wire. 

Address: Each device in the system is assigned a unique memory address range. The 
address bus, used in combination with the data bus, determine what device the 
current value on the data bus will be written to and, if applicable, where within that 
10 device it will be stored. The width of the address bus (i.e. the number of wires in it) 

is determined for a choice of size for the memory storage of input and output and 
thus will not be specified here. 

Data. Given the assumption that only character-valued data attributes will be handled 
by this system, the data bus is eight wires wide. 

15 Bus arbitration is handled through the use of the control bus. When a device (here 

meaning MG, IO or one of the CPs) wishes to use the bus, it asserts a logical 1 on its 
request wire. On any given cycle, more than one device may have done so. IO, when it 
returns to its bus arbitration duties, simply sets the lowest numbered device's response wire 
to 1 and zeroes all the other response wires. This tells the lowest identified device that it has 

20 permission to use the bus (reads and writes are not indicated - IO is responsible for 

establishing this context) and all others that they must wait. All devices that wish to use the 
bus continue to assert 1 on their request wire until given permission. When the permitted 
device has finished with the bus, the device asserts 0 on its request wire, indicating to IO 
that it may reassign the bus to another device. "Handshake" and other types of protocols, 

25 such as described above, are well-known to and understood by those skilled in the art. 

4. Bit Flags and Additional Memory 

The additional memory is used by IO to store the last written output section. There is 
no need to store a list of such sections for MG because "write" s to the output buffer are 
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done incrementally and MG can determine how many unused sections it has waiting by 
comparing its last read index with the last written index. Only IO can write to this memory 
and only MG may read from it. 

Two bits flags are used to indicate "IO finished" (meaning IO has sent all data out 
5 and received all CP output) and "Merge finished". 

5. An Array of / Compute Processors 

As noted above, these are either nodes or are special purpose processors that 
compute one i?-sampling step in the algorithmic description of the general method of the 
current invention. In the latter case, they may comprise: 

10 a processor which performs the coincidence detection in addition to the functions 

listed below 

2*Nmax*Rmax sized local memory 

The memory is split into two equal portions for input and output. 

Initially, a CP asserts 1 on its request wire, indicating that it is ready for data. When 
15 it sees only its response wire set to one on the following cycle, it expects to be sent the 
current values for R and N and then the data itself (otherwise, it waits for this to be the 
case). Based on the first two values, it can determine when the current input is exhausted. It 
then asserts 0 on its request wire and performs the binning and coincidence detection steps 
of the method. When these steps have been completed the CP asserts logical 1 again on its 
20 request wire, this time indicating its desire to send its results. When given permission to use 
the bus, it sends its coincidence set to IO. IO is responsible for managing the location for 
storage of this data. The output stream of the CP comprises a tally of the coincidences found 
followed by the coincidences (csets) themselves. The coincidences are of the form: 

hit count (no higher than Rmax) 
25 size (that is, the width of the cset, i.e., the number of component attributes) 
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a 5/ze-long list of the attributes of the coincidence in form (value, position) 

When all data has been sent to 10, the CP asserts 1 on its wire to request more data. 

6. Merge Processor MG 

The merge processor may comprise: 
5 a processor that runs the merging step 

NmaxRmax local memory used to store the output from one CP 
counters CI and C2 (the former tracks the last output section read by MG; 
the latter counts the number of coincidences currently stored in the merge 
buffer) 

1 0 memory used to store the current value of A 

memory of size JNmaxRmaxAmax used to store the merged results 

Initially, MG sets its counters to zero and its request wire to zero and waits for IO to signal 
it (by setting this wire to 1) that there is output data to be processed. 

When MG sees that its request wire has been turned on, it knows to start receiving 
1 5 output data indexed by the counter into its local memory. Once this has been accomplished, 
MG can start the merging algorithm. The merge is done from the local memory directly into 
the merge buffer (C2 must have the current number of coincidences when this step is 
finished). When this step is completed, MG retrieves the current value of LastOut. If it is 
greater than CI, then MG knows it can increment CI and move directly on to the next 
20 output section. If CI and LastOut are equal, then MG sets its request wire to zero. If CI has 
reached A */, then MG knows that all the results have been computed and merged (and thus, 
that all CPs and 10 are idle) and that it should set its bit flag to one (indicating that it is 
finished) and start sending the contents of the merge buffer back to IO for transmission to 
this node's parent. The results are sent simply as the value of C2 followed by the list of 
25 coincidences stored in the merge buffer (the form of the coincidences is identical to that 
described in section 5 above). 
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7. Input/Output Processor IO 

IO contains: 

a bit vector of size J 

a counter, CI, indicating the next available output bin 

a counter, C2, indicating the next unused R*N portion of input 

IO is intended to govern the execution of the algorithm as a whole as it is responsible for the 
bus arbitration scheme outlined earlier. Initially, IO sets CI and C2 to zero and zeroes its bit 
vector (indicating that it has sent no data to any CP) and waits for the software driver to 
start sending it data. During this time, it knows that no work can be done, and thus zeroes 
all permissions for the bus. An interrupt signals the arrival of data from the driver and IO 
continues to zero all communication requests until all the data has been written to the input 
buffer. The incoming data is of form: 

N 

R 

T, the total number of row sets of size R sent 
data stream of size TRN 

IO can thus determine when no more data can be expected. Note that it is the responsibility 

of the driver to: 

divide data mining requests into sizes no greater than Amax 
ensure that the number of rows sent as input is evenly divisible by R 
ensure that Rmax and Nmax have not been exceeded by the current data set 
merge all results sent back from the device 

Once all input has been stored, IO sends out data of size R*N to each CPj by first setting the 
ith bit in the vector to one (this indicates that IO should expect output from CPi), signaling 
that CP by setting its response wire to 1 while zeroing all others, sending the data onto the 
bus and finally incrementing C2. 
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When all CPs are busy (or all available input has been exhausted), IO waits for a CP 
to assert 1 on its request wire which indicates that it is ready to send back results. Once this 
signal has been received from a CP, IO retrieves the results from the CP, stores them in the 
output section indexed by the counter, zeroes the bit associated with that CP, increments CI 

5 and asserts Ion the MG request wire. If there is unused data in the input buffer, IO sends the 
next available R *N set to the CP who just returned results (setting the bit for that CP to 
one). When C2 equals Zand the bit vector contains no bits set to 1, then IO knows that it is 
finished and sets the IO bit flag to 1 . At this point, IO goes back to the previously described 
wait state until it sees the MG bit flag also set to 1 (indicating that MG has finished its 

10 work). Once this occurs, IO calls an interrupt (if this node is the root of the tree) or just 

requests to send (if this node has another node for a parent), gives MG permission to write 
on the bus and then passes all data sent from MG to the parent. 

Note that the proposed scheme allows for unequal execution time among the CPs - 
the next CP to get data is the one most recently finished with its last allowance of data. 
15 Thus, even though the overall operation of the system is clocked, there is a degree of 
asynchronous processing ability. 

The choices for particular processors, buses and other components are open to the 
discretion of designers, fabircators, manufacturers, sellers, buyers and users, and the ranges 
of options are known to those skilled in the art: In particular, all parts of the embodiment 
20 described above may be obtained from "off-the-shelf sources, or may be specially designed 
at the VLSI level by persons skilled in the art. 



Different Applications 
General 



Special-purpose embodiments are also possible. For example, in an application to 
25 marketing and analysis of sales/transactions data, the objects input to the methods of the 

present invention can correspond to transactions, and the attributes correspond to instances 
of sale of particular products or services. 
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In an application to the process management, industrial engineering or computer 
systems management, the objects can correspond to particular time slices or time periods, 
and the attributes correspond to the on/off or used/unused status of particular components, 
resources, or subsystems. The goal of the application could be to find k-ary conflicts or 
conflicting demands among interacting subsystems or users, in order to improve the 
efficiency or lower the costs of the operations. 

For example, the methods can be adapted to control a process for production of a 
product as shown in the general flow diagram of Fl jure 8 and the schematic diagram of 
Figure 9. This example can represent an automated sheet metal assembly plant. The 
methods could be applied to existing data set in order to discover correlation that indicate 
demand for one of the products from the plant will significantly decrease in the summer 
months due to cyclical variations, while demand for another product increases. A link to 
automated process control systems in the plant could reduce orders for the first product, 
while increasing orders for another. Many other examples will be evident to those skilled in 
the art, including variations to the actual structure of the products as a result of discovered 
correlations. 

In an alternate embodiment, the discovered correlations may be used to generate 
rules for a rules based system that in turn produces products based upon those rules. A 
general flow diagram for such an embodiment is set out in Figure 10. A corresponding 
20 schematic diagram is set out in Figure 1 1 . 

In a further alternate embodiment, the rules based system could be used to control a 
process that creates products. A general flow diagram for such an embodiment is set out in 
Figure 12. A corresponding schematic diagram is set out in Figure 13. 

In application to financial analysis or trading, the objects can correspond to particular 
25 time slices or time periods, and the variables can relate to particular prices, or price changes, 
of particular financial instruments or commodities. By dividing the prices of each instrument 
or commodity into a set of discrete levels, or by using a simple binary code for "increase vs. 
decrease", one can represent each such instrument or commodity by a set of attributes, and 

-55- 

SUBSTITUTE SHEET (RULE 26) 

BNSDOCID: <WO 9843182A1JA> 



10 



15 



WO 98/43182 



PCT/CA98/00273 



the invention can be employed to discover k-tuples of instruments or commodities whose 
price movements are correlated. Those in the art know of many ways to gain value from 
such discovered information. 

In applications to medicine, epidemiology, or environmental science, the objects can 
5 correspond to particular patients, or to different timed observations of a single patient, or 
samples from the same or different environmental resource (such as air, soil, or water); the 
variables and derived attributes would correspond to levels, or the presence/absence of 
particular symptoms, drugs, toxins or contaminants. In this way, one can use the present 
invention to discover interactions that may cause disease or environmental hazards. 

10 In molecular and structural biological applications, the objects might correspond to 

DNA, KNA, or protein sequences and/or structures. The attributes might correspond to the 
presence of particular bases or amino acids at particular sequence positions, or to 
substructures with particular geometric, chemical, physical, or biological properties at 
particular sequence or structural positions, or to the presence or absence or levels of other 

15 global or local properties. For example, set out further below is a detailed application of the 
method to protein structure prediction, examples of which have previously been described.. 

In pharmacological applications, the object might correspond to molecular structures 
or other labels or representations of particular compounds or drugs, and the attributes might 
correspond to the presence, absence, or levels of particular geometric, chemical, physical, 

20 biological, toxicological, therapeutic and/or other properties and features, e.g., particular 
chemical moieties. The present method would be used to find correlations among k-tuples 
of such properties, and this information can be useful in the design and testing of compounds 
and drugs, and in the design of combinatorial libraries for screening and testing, or for other 
processes or steps in drug discovery and drug design. Alternatively, the above mapping can 

25 be transposed, so that the objects correspond to the properties and features, and the 

attributes correspond to the compounds and drugs. In this way, the present invention can be 
used to find sets of drugs with similar or complementary or synergistic or antagonistic 
activities. This, too, is extremely useful in drug discovery and drug design. 
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In applications to demographics, marketing, insurance and credit ratings, and/or 
fundraising, the objects can correspond to particular people, or companies, or organizations. 
The attributes could correspond to the presence or absence or levels of properties and 
features relating to employment, income, wealth, credit history, lifestyle, consumption 
5 patterns, or social/political opinions or affiliations. The present method could be used to 
discover associations between such factors, which can be useful in such tasks as predicting 
credit/insurance risks or detecting fraud; or in determining the best targets for allocating 
limited marketing or fundraising resources, for example. 

The problem of finding all significant correlations among pairs or ^-tuples of 
10 attributes in a database is ubiquitous in the computational sciences and in medical, industrial, 
and financial applications. The principles described herein include a probabilistic algorithm 
that has the interesting property of finding significant higher-order £-ary correlations, for all 
k such that 2 <; k< Nin an ^/-attribute database, for the same computational cost of finding 
just significant pairwise correlations. Moreover, k need not, be fixed in advance in our 
15 procedure, in contrast with other known procedures. The procedure was deigned for the 

task of finding conserved structural relationships in aligned protein sequences, but may have 
more useful application in other domains- 
Application of the Principles Described Herein to Protein Sequence Analysis 

There are interactions between sequence-distant amino acid residues in the protein 
chain, sometimes detectable as correlations between positions (columns) in a set of aligned 
sequences from a protein structural family, that play an important role in determining 
structure and function. Discovered correlations may represent an evolutionary history of 
compensatory mutations, and may provide useful features in models of protein 
structural/functional families, but are ignored or mishandled by most ML (machine learning) 
classification methods, in part because of the high computational complexity of searching for 
k-tuples of correlated positions. 

In order to practice the invention on a matrix of biological sequences such as 
nucleotide or amino sequences, the different sequences are first optimally aligned for the 
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purpose of comparison. A position in a first sequence is compared with a corresponding 
position in a second sequence. When the compared positions are occupied by the same 
nucleotide or amino acid, as the case may be, the two sequences are identical at that 
position. The degree of identity between two sequences is often expressed as a percentage 
representing the ratio of the number of matching (identical) positions in the two sequences 
to the total number of positions compared. Optimally aligning two or more sequences 
generally involves maximizing the degree of sequence identity between them. 

Several algorithms and computer programs are known to those of ordinary skill in 
the art for aligning sequences. These tools include the PELEUP program from the Genetics 
Computer Group (Madison, WI)package (version 8) using a modified version of the 
progressive alignment method of Feng and Doolittle [J. Mol. Evol. 25, 351 (1987)]; 
CLUSTAL X, freeware available from the European Molecular Biology Laboratory 
(EMBL), Heildelberg, Germany; and BLAST, freeware available from the National 
Institutes of Health (NIH), Bethesda, MD., BLAST-P is used for amino acid sequences; 
BLAST-N is used for nucleotide sequences and BLAST X is used for nucleic acid 
codon/amino acid translation. 

Several kinds of useful information can be obtained from protein sequence family 
analysis. 

First, there is information to be extracted at the level of individual sequences, in the 
form of joint symbol frequencies. It is well-known that an abnormally high observed 
frequency of a particular single-position pattern (e.g., "G occurs at residue number 3 in 98% 
of these sequences") can reveal an important physico-chemical constraint on secondary or 
tertiary structure. This is also true of surprisingly-frequent joint symbol occurrences (e.g., 
"G at position 3, L at position 5, and M at position 87 occurs much more often than would 
be predicted by the individual marginal frequencies"). Such long-distance co-occurrences 
might be especially indicative of tertiary constraints, because the designated positions may be 
nearby each other in the 3D structure to which all of the modelled sequences correspond. 
(This detection of "suspicious coincidences", as when p(A,B) » p(A)p(B), is at the heart of 
pattern recognition and learning, as noted long ago by others). 
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Second, there is information to be extracted at the "next level up", of statistical 
relationships between the positions (columns in an alignment of homologous sequences). If 
the existence of frequently occurring joint symbol ^-tuples can be used to infer 3D structural 
interactions, such an inference is even better supported by certain information-theoretic 

5 relationships between positions (columns) over a set of many different joint symbol 
occurrences. This is because such symbolic relationships can signify evolutionarily 
conserved physical or structural relationships between different parts of the protein chain. 
(See Figure 15), The observation of high values of mutual information and other correlation 
measures between columns has been used successfully to predict 3D structural interactions 

10 in RNA and in HTV proteins, for example, see C.E. Shannon and W. Weaver The 

Mathematical Theory of Communication The University of Illinois Press, 1964. While these 
previously reported efforts have focused on pairwise residue-residue interactions, the 
principles described herein, aim at the detection of £-ary interactions for 2 <; kz N. 

Discovered k-tuples of correlated amino acid residues cane be used in protein 
15 structure prediction and structure determination. 

Local predictions can help narrow the search for the best global structure 
predictions. 

First, there are distance geometry constraints. Secondary structure prediction, and 
the discovery of £-ary long-distance interactions, give evidence for presumed contacts, of 

20 the form contact(ij) for the /th and yth amino acid residues in a protein. Using the kind of 
distance geometry theory developed by others (see for example, T.F. Havel, L.D. Kuntz, 
G.M Crippen The Theory and Practice of Distance Geometry Bull, of Mathematics Biology 
v.45 1983 pp. 665-720. and K.A. Dill, K M. Feibig, H.S. Chan Cooperativity in Protein- 
Folding Kinetics Proc. Natl. Acad. Sci. U.S.A. v.90 March 1993 pp. 1942-1946), one can 

25 derive a set of inferred contacts. One can also derive sets of inferred blocks, contacts that 
are forbidden by a given set of presumed or inferred contacts. Essentially, given a model of 
a polymer chain constrained to exist within a fixed volume, the assumption that two 
particular pieces are brought into contact implies that some other pieces are also brought 
into proximity and that still other pieces are moved further apart. Indeed, others have 

- 59- 

SUBSTTTUTE SHEET (RULE 26) 



BNSDOCID: <WO 98431 62A1_IA> 



WO 98/43182 



PCT/CA98/00273 



concluded that "considerable amounts of internal architecture (helices and parallel and anti- 
parallel sheets) are predicted to arise in compact polymers due simply to steric restrictions. 
This appears to account for why there is so much internal organization in globular proteins." 

Second, as discussed throughout the previous sections, one can infer and exploit 
5 empirical relationships between local and global configurations. Local stretches of sequence, 
or selected non-local pairs of residues, can be found to occur, with some high probability, in 
particular global configurations. Heuristic rules, in whatever form, can be used to avoid 
large parts of conformation space. The inference of particular models of cooperativity in 
folding is a special case: knowledge of "rules" such as p(contact(ij)\contact(i + \j 1)) > 
10 p(contact(ij)) can help significantly. 

For example, Figure 16 illustrates steps in tertiary structure prediction. The methods 
described throughout this application can be applied as part of a larger tertiary structure 
prediction system, wherein the principles described above are employed in the block related 
to the analysis of aligned sequence families. The system predicts the structure of a protein. 

15 Discovery of Evolutionarily-Conserved Structural Constraints 

Three questions are addressed in this section: 

1 . What kinds of evolutionary conserved multi-residue structural or functional 
constraints might one expect to find by detecting correlations between 
columns in a multiple sequence alignment? 

20 2. Have correlation-detection efforts in fact found important structural or 

functional constraints? 

3. How much information do such discoveries provide towards predicting or 
determining a molecule's native tertiary structure? 
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What Do We Expect to Observe? 

A protein family is the set of amino acid sequences that are believed to share a 
common global tertiary structure. The theory and observation of protein folding and 
evolution supports the general idea of evolution and conservation within a protein family: 

5 • Functional constraints are conserved in surface residues; 

• Structural constraints are conserved in core residues; 

• Mutational drift dominates in loop residues; 

Functional constraints often involve other molecules - such as other proteins, nucleic 
acids, lipids, metals, 0 2 or other small molecules. 

10 The kind of structural constraints expected to be conserved throughout evolution of 

a protein family are mainly those involving a few key residues that stabilize a confirmation. 
Where electrostatic interactions are deemed important, one might expect to find a 
conservation of net charge across two or more sequence positions. When one of two 
electrostatically interacting residues carries a positive charge, its "partner" residue 

15 (presumably close in 3D structure even if distant in sequence) should be negatively charged, 
and vice versa. The situation is similar for packing constraints. One might reasonably 
expect sections of the protein core volume to vary only slightly across the many different 
proteins in the same structural family, while non-core regions might display large volume 
variability. Thus one might expect to find pairs or small ^-tuples of residues that display 

20 mutually compensatory mutations with respect to side-chain volume - when a "Large" 

mutates to a "Small", another "Small" must mutate into a "Large", to put it simplistically. 

What Has been Observed? 

Neher et al (How frequent are correlated changes in families of protein sequences 
PNAS, 91 :98-102, 1994) attempted to quantify the frequency of compensatory changes 
25 within a single protein family by using physico-chemical property indices for amino acids and 
then estimating Pearsonian correlations between columns in an alignment. They attempted 
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to get around the small-dataset problem with a bootstrap-inspired resampling scheme based 
on the examination of pairs of sequences from the family. Their study of the myoglobin 
family of protein sequences found the degree of compensatory mutation to be low for the 
property of side-chain volume but high for electrical charge - close to the correlation level 
5 expected for perfect conversation of local charge. The authors speculate that because their 
column-pair analyses focused only on contact-neighbour pairs of residues, they were able to 
detect a very locally-acting constraint like charge conservation but not a more distributed 
constraint like conservation of volume. (In other words, a single positively-charged residue 
must be in contact with its single negatively-charged structural partner, whereas a set of 
10 compatible-volume partners may comprise more than two residues and need not all be in 

contact). Others have also found some evidence of coordinated mutation in the evolution of 
protein structural families. 

While most studies, to date, of compensatory mutation focus on highly-conserved 
"core"-type regions of protein structures, Korber et al. (Covariation of mutations in the V3 
loop of HIV-1: An information-theoretic analysis. Proc. Nat. Acad. Sci, 90, 1993) analyzed 
the highly-variable V3 loop of the HIV-1 envelope protein. The researchers performed 
robust bootstrapped estimates of the pairwise mutual information for all column-pairs from a 
set of 31 columns, representing V3 residues. They found a set of about seven pais that 
showed considerable and statistically-significant mutual information, and their analysis of the 
particular attributes (amino acids) suggested a particular pattern of highly likely 
compensatory mutations. Although the authors did not argue or provide evidence for any 
particular properties or relationships being conserved, subsequent mutational analysis 
experiments in the laboratory indicated functional linkage between some of the pairs of sites 
with high mutual information. Because the V3 region is known to be both functionally and 
immunologically important, the inventor of the instant application suggested that such 
analyses might be important in the search for HIV/ AIDS vaccine design. 

What Kind of Method is Needed? 

Clearly, several well-studied and effective methodologies exist for the comprehensive 
modelling of protein sequence families. In each case, the mathematical machinery is in place 
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to handle and detect very local and low-order statistical structure in the data. In each case, 
the difficulties with computational complexity and statistical estimation arise in the attempt 
to account comprehensively for all possible non-local and higher-order interactions between 
residues, i.e., columns in the aligned sequence data. 

5 Easier progress in modelling can be made if one is to use HMMs or density networks 

in conjunction with a fast, heuristic preprocessor that focuses explicitly on the detection of 
plausible non-local interactions while sacrificing a degree of precision in modelling these 
interactions. Such a procedure is provided by the principles described herein. 

a) HIV PROTEIN SEQUENCE ANALYSIS 
10 Tests on an HIV Protein Database 

The Los Alamos HIV database contains, among other things, the amino acid 
sequences for the V3 loop region of the HIV envelope proteins. This region is known to 
have functional and immunological significance, and the discovery of sets of sites linked by 
evolutionary covariation might have important implications for understanding and preventing 
15 HIV infection and replication. 

An earlier and smaller version of the same database was used by Los Alamos 
scientists in their analysis of pairwise mutual information between residues (columns). 

Experiments were performed on an HIV dataset with the coincidence detection 
procedure, over a set of different values for r and T. Tables of results are shown and 
20 discussed below. 

Results of Experiments on HIV Protein Database 

The aforementioned version of the HIV-V3 dataset was edited in order to focus on 
the thirty-three residues considered most conserved and most structurally and functionally 
important by the Los Alamos researchers. The dataset therefore consisted of A/= 657 rows 
25 (sequences) of N = 33 columns (residues). For the coincidence detection procedure, these 
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33 columns are transformed into N A = N M\ = 33 . 21 = 693 attributes. As with the 
artificial dataset s, a set of experiments with different values of T and r were performed. 
Coincidence detection runs were done with T= 10,000 and r = 5, 6, 7, 10 respectively, and 
with T= 100,000 and r = 7, and finally with T= 750,000 and r = 7. The results are shown 
5 in tables C. 1 through C.9 below. 

Table C. 1 : The most likely correlated attributes, as estimated by the coincidence detection 
procedure, for the HTV dataset. These results were produced with parameter settings 
T= 10,000 andr = 5. 



HIV Dataset. 

10 T= 10,000, r = 5. 



Rank CSET Observed Expected Prob. 



15 



1 


07|£>24 


1012 


632.553864 


0.316056 


2 


R17\T2\ 


901 


610.770465 


0.509734 


3 


R12\Q\1 


570 


348.605833 


0.675621 


4 


L\3\W19\Q24 


195 


5.535741 


0.750381 


5 


N4\K9\A2l 


226 


74.167398 


0.831582 


6 


ni|i?12|718 


159 


20.764346 


0.858239 


7 


R\2\TIS 


454 


318.517747 


0.863429 


8 


1131*31 


419 


300.333903 


0.893461 
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Table C.2: The most likely correlated attributes, as estimated by the coincidence detection 
procedure, for the HIV dataset. These results were produced with parameter settings T- 10, 
000 and r= 6. 

HIV Dataset. 
5 T= 10, 000, r = 6. 



Rank CSET Observed Expected Prob. 





1 


017|Z>24 


1177 


385.853329 


0.030891 


10 


2 


R\1\T1\ 


957 


368.736702 


0.146238 




3 


#121/118 


1047 


577.583832 


0.294000 




4 


SlO|£>24 


859 


424.457490 


0.350274 




5 


#121017 


656 


224.743830 


0.355855 




6 


#121718 


628 


283.191527 


0.516585 


15 


7 


R\7\E24 


563 


234.477161 


0.549033 




8 


H12\R\7 


760 


434.274580 


0.554644 




9 


AlS\TZl 


560 


315.973734 


0.718330 




10 


Il\\R\7 


861 


627.014684 


0.737741 




11 


LU\W\9\Q24 


230 


5.365202 


0.755529 


20 


12 


A2\\D24 


619 


405.487239 


0.776262 




13 


NA\K9\A2l 


237 


25.176801 


0.779367 




14 


ni|tfl2|718 


220 


15.841474 


0.793296 




15 


Z,13|A31 


462 


267.211446 


0.809942 




16 


GlO\H\2 


324 


157.554658 


0.857348 


25 


17 


M\3\W\5 


245 


84.760597 


0.867059 




18 


017|*:31 


384 


231.749746 


0.879169 




19 


H12\R\1\A\S 


147 


8.219536 


0.898526 




20 


N4\K9\H33 


309 


170.353419 


0.898711 
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Table C.3: The most likely correlated attributes, as estimated by the coincidence detection 
procedure, for the HIV dataset. These results were produced with parameter settings 7=10, 
000 and r = 7. 

HIV Dataset. 
5 r= 10,000, r= 7. 





Rank 


CSET 


Observed 


Expected 


Prob. 




1 


S~\ m mm 1 r\M A 

Q17|D24 


1312 


228.829775 


0.008322 


10 


2 


^4(^9 


2023 


996.50563 1 


U.Ul 




3 


H\2\A\S 


1 175 


328.263693 


A A C "5 cn 1 

0.053591 




4 


/?17|721 


940 


216.431391 


All OA1 C 

0. 1 18U15 




5 


231|//33 


3198 


24S1.03U91 J 


u. izzoyy 




6 


/il2|yi8 


a/9 


244. /oyz94 


a iQi/^e 

u. iyjo*i D 


15 


7 


S10SD24 


836 


232.201517 


0.225812 




8 


rtl2|017 


720 


140.866087 


0.254370 




9 


/11|#17 


808 


360.719364 


0.441944 




10 


H12\R17 


659 


253.717115 


0.511491 




11 


Rll\Am 


720 


361.819054 


0.592356 


20 


12 


A2\\D24 


554 


236.085429 


0.661974 




13 


R\7\E24 


452 


138.843412 


0.670137 




14 


U3\K3\ 


537 


231.13 7972 


0.682602 




15 


L\3\W\ 9|224 


292 


5.055474 


0.714573 




16 


/U8|721 


442 


165.231990 


0.731502 


25 


17 


/H8|g31l//33 


480 


209.122778 


0.741198 




18 


MU\W\5 


355 


88.975694 


0.749122 




19 


N4\K9\H33 


340 


75.556215 


0.751690 




20 


Hl|i?12 


513 


253.001684 


0.758878 
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Table C.4: The most likely correlated attributes, as estimated by the coincidence detection 
procedure, for the HTV dataset. These results were produced with parameter settings 7=10, 
000 andr = 10. 

HTV Dataset. 
5 T= 10,000, r= 10. 



Rank CSET Observed Expected Prob. 





1 


031|#33 


3933 


883.532458 


0.000000 


10 


2 


jV4|*9 


2898 


251.248235 


0.000001 




3 


510|F19 


2245 


907.769718 


0.027977 




4 


F19|G23 


2660 


1588.173503 


0.100497 




5 


i?12|718 


1155 


142.229768 


0.128554 




6 


K9\I\\ 


1230 


311.653160 


0.185125 


15 


7 


A\&\H33 


1720 


990.576490 


0.345032 




8 


K9\H33 


1125 


405.874883 


0.355482 




9 


//12|418 


732 


54.213558 


0.399002 




10 


S10|G23 


1492 


856.152048 


0.445479 




11 


N4\H33 


1257 


689.784961 


0.525468 


20 


12 


4181231 


1188 


636.901303 


0.544755 




13 


gl7|D24 


571 


42.938312 


0.572525 




14 


VU\R\2 


670 


143.659674 


0.574607 




15 


I\\\R\7 


562 


61.788305 


0.606274 




16 


N4\R17 


992 


498.586806 


0.614520 


25 


17 


R\2\Q\7 


484 


31.204991 


0.663619 




18 


K3\\Vi3 


578 


130.131866 


0.669535 




19 


R\1\T1\ 


479 


39.372545 


0.679400 




20 


510|D24 


451 


34.199456 


0.706491 
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Table C.5: The thirty most likely correlated attributes, as estimated by the coincidence 
detection procedure, for the HTV dataset These results were produced wich parameter settings 
T= 100, 000 and = 7. 

fflV Dataset. 
5 T = 100, 000, r= 7. 



Rank CSET Observed Expected Prob. 





1 


H12\A\Z 


1 1 686 3282.636926 


A AAA Ann 

0.000000 


10 


2 


N4\K9 


21853 9965.056308 


A A/"\AAAA 

0.000000 




3 


Q\1\D2A 


11585 2288.297747 


A AAAAAA 

u.oooooo 




4 


031 [#33 


31715 24810.509148 


A AAAAAA 

0.000000 




5 


R17\T2\ 


9355 2164.313906 


A AAAAAA 

0. 000000 




6 


/cl2|gl7 


*T**CA 1 /I /AO AO O 

7259 1408.oo0»oo 


A AAAAA1 

U. 00000 1 


15 


7 


/?12|718 


8380 2447.892930 


a a a Ann i 
0.000001 




8 


S10|D24 


7666 2322.015166 


0.000009 




9 


/111K17 


8336 3607.193645 


0.000109 




10 


A2l\D24 


6342 2360.854285 


0.001550 




11 


H\2\R17 


6363 2537.171146 


0.002543 


20 


12 


R17\A\S 


7162 3618.190543 


0.005941 




13 


R\7\E24 


4451 1388.434119 


0.021747 




14 


A\S\T2\ 


4673 1652.319901 


0.024130 




15 


H 1|^12 


5486 2530.016841 


0.028256 




16 


L13|AT31 


5224 2311.379719 


0.031348 


25 


17 


N4\K9\H33 


3519 755.562151 


0.044291 




18 


A\Z\Q3\\H33 


4665 2091.227775 


0.066951 




19 


Z,13|^19|024 


2585 50.554739 


0.072672 




20 


17|031 


5967 3574.032278 


0.096592 




21 


MU\W\5 


3204 889.756945 


0.112364 


30 


22 


K11| J R12|7"18 


2424 117 500168 


0.114017 
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5 



23 


W4|/421 


6209 


4030.321314 


0.144077 


24 


AT31|K33 


4878 


2773.817984 


0.164117 


25 


0171/31 


3440 


1450.098718 


0.198651 


26 


K9\A2\ 


5614 


3692.671816 


0.221632 


27 


P\9\D24 


3998 


2250.071839 


0.287354 


28 


Q\1\A1\ 


4151 


2414.536189 


0.292077 


29 


G10|#12 


2661 


953.572593 


0.304245 


30 


#12|£24 


3018 


1458.576938 


0.370622 
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Table C.6: The first twenty-five of the fifty most likely correlated attributes, as estimated by 
the coincidence detection procedure, for the HTV dataset. These results were produced with 
parameter settings 7= 750, 000 and r = 7. Note the appearance, at this degree of sampling, of 
several statistically significant higher-order features with k z 3. 

5 fflV Dataset. 

T= 750,000, r = 7. 



Rank CSET Observed Expected Prob. 



10 



15 



20 



30 



0 


A 18|03 1|#33 


36019 15684.208314 


0.000000 


1 


A\S\T2\ 


33816 12392.399254 


0.000000 


2 


A2l\D24 


45549 17706.407140 


0.000000 


3 


H\2\AIS 


86025 24619.776947 


0.000000 


4 


H12\R17 


48257 19028.783592 


0.000000 


5 


IU\R17 


64548 27053.952336 


0.000000 


6 


L\3\K3\ 


39382 17335.347894 


0.000000 


7 


Z.13|W19|(224 


20184 379.160544 


0.000000 


8 


M\3\W\5 


23300 6673.177086 


0.000000 


9 


N4\K9 


162152 74737.922307 


0.000000 


10 


N4\K9\H33 


26376 5666.716129 


0.000000 


11 


Q\7\D24 


86891 17162.233105 


0.000000 


12 


Q31|//33 


23319086078.318611 


0.000000 


13 


R12\Q17 


53740 10564.956512 


0.000000 


14 


R\2\T\S 


62774 18359.197022 


0.000000 


15 


R\7]A18 


54366 27136.429076 


0.000000 


16 


R\7\E24 


33748 10413.255892 


0.000000 


17 


R\7\Q3l 


45065 26805.242087 


0.000000 


18 


R\7\T2\ 


70301 16232.354294 


0.000000 


19 


S10|£>24 


57772 17415.113746 


0.000000 


20 


Filial 2 


39546 18975.126308 


0.000000 
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21 Fll|rtl2|ri8 17628 881.251263 0.000000 

22 K3l\Y23 36346 20803.634880 0.000002 

23 N4\A2\ 45441 30227.409858 0.000003 

24 Q17\K3\ 25033 10875.740384 0.000018 
5 25 G\0\H12 20779 7151.794446 0.000041 
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Table C.7: Continuation of the fifty most likely correlated attributes, as estimated by the 
coincidence detection procedure, for the HIV dataset: csets ranked 26 through 50. These 
results were produced with parameter settings 750, 000 and r = 7. Note the appearance, 
at this degree of sampling, of several statistically significant higher-order features with k z 3. 

5 HIV Dataset. 

T= 750,000, r = 7. 





Rank 


CSET 


Observed Expected 


Prob. 


10 


26 


A9|/421 


40098 27695.038620 


A AAA">'3 1 

U.UUUzJ 1 




27 


F19|Z?24 


29121 16875.538795 


A AAAOQ/C 

O.UUUZoo 




28 


{717|/421 


OA/ro l 1 C 1 AO A") 1 /l 1 *7 

29ozl 1 o lUV.Uz 1 / 


u.uuu lot 




29 


H\2\E24 


22348 10939.327UJO 


A AHAQIO 

u.uuuojy 




30 


N4\K9\IU 


15175 4159.316971 




15 


3 1 


54|79|ri2|n8|i?21 


10919 1.718549 






32 


N4\K9\A2l 


11233 623.181959 


toe 

0.002185 




33 


N4\Q3\\H33 


21868 11328.342993 


0.002369 




34 


F\9\A2\ 


44400 34516.144368 


0.004910 




35 


K9\Q3l\tm 


16593 6991.723718 


0.006625 


20 


36 


W\9\Q24 


16738 7234.038664 


0.007331 




37 


E\\N\2 


10844 1492.835945 


0.008575 




38 


K9\E24 


13847 4587.312260 


0.009408 




39 


K9\R\1 


33735 24568.179150 


0.010326 




40 


T\2\V\Z 


23076 14893.617567 


0.026158 


25 


41 


R\2\A2\ 


15497 7516.155896 


0.031231 




42 


N4\K9\Q3\\H33 


8280 493 681367 


0.036905 




43 


N4\K9\A\% 


11655 4250.900600 


0.050618 




44 


54|7P|ri2|F18|/?21|}33 7370 0.093039 


0.052029 




45 


7?12|ei7|718 


7452 240.364918 


0.058992 


30 


46 


V\\\Q\1 


14350 7329.962834 


0.068429 
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tfl2|721 23263 16324.923094 0.072825 

Q\7\Y33 17288 10374.788061 0.074203 

LU\W19 15536 8921.243955 0.092437 

Sll\H2S 6529 138.997153 0.108375 
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Table C.8: The top thirty-five pairwise inter-column mutual information values for the HTV-V3 
dataset, as estimated by our methodology as described in the main text. 



Rank Pair /, j MI (c„ Cj) Std. Error 



5 





1 


12|18 


0.340449 


0.037792 




2 


4|9 


0.337943 


0.0389162 




3 


9|21 


0.319481 


0.0353829 




4 


23|24 


0.315202 


0.0337213 


10 


5 


12|24 


0.314393 


0.0330382 




6 


9|24 


0.313992 


0.0344732 




7 


19|24 


0.305609 


0.0335857 




8 


11|24 


0.297498 


0.0358645 




9 


24|26 


0.290044 


0.0384839 


15 


10 


9|11 


0.289911 


0.0344244 




11 


9(23 


0.285019 


0.0343224 




12 


4|21 


0.284936 


0.0332236 




13 


18(21 


0.278151 


0.0404634 




14 


4|11 


0.277189 


0.0353993 


20 


15 


12(21 


0.273137 


0.033385 




16 


4|24 


0.262226 


0.036189 




17 


21|24 


0.260366 


0.0338395 




18 


11(23 


0.260337 


0.0323302 




19 


11|19 


0.249877 


0.0320634 


25 


20 


10|24 


0.248938 


0.0325318 




21 


19|23 


0.242185 


0.032301 




22 


5(26 


0.239395 


0.0386373 




23 


9|19 


0.238318 


0.0331283 




24 


4|23 


0.23359 


0.0302795 


30 


25 


24|25 


0.222109 


0.0358744 




26 


6|26 


0.220371 


0.0397722 
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5 



10 



27 


4|26 


0.220213 


0.0333324 


28- 


6|24 


0.218815 


0.0335123 


29 


9)12 


0.214844 


0.0280984 


30 


15|24 


0.213921 


0.0301834 


31 


10|12 


0.2133 


0.0306496 


32 


9|18 


0.21078 


0.031734 


33 


1 1(21 


0.210155 


0.0308121 


34 


11|12 


0.209421 


0.0294066 


35 


4|19 


0.2091 1 


0.0290533 
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Table C.9: The top seven pairwise inter-column mutual information values for the HTV-V3 
dataset, as estimated by the Los Alamos group. 



Rank 


Pair/, j 


1 


23|24 


2 


12|24 


3 


12|18 


4 


12|23 


5 


19|24 


6 


10|24 


7 


10|12 



Tables C. 1 through C 4 illustrate the most significant csets (again 
15 measured by our procedure's estimation of P(ObservedMndependence) for the Observed 

number of coincidences for each detected coincidence of attributes. As one might expect, a 
clean separation between "probably correlated" and "probably uncorrelated" does not 
manifest itself at this comparatively low degree of sampling for this reaUworld dataset. 
Results for r= 7 and r = 1 0 indicate more significant discovered csets than those for r = 5 
20 and r = 6. At these former, higher r values, one sees the emergence of a few csets with 
"Prob" values less than 0.1: (Q@M, D@24) f (N@A 9 K@9), (H@12, A@IS), (Q@3\, 
H@23) and (S@\0, F@19). All of these csets appear among the most significant csets 
reported in the more intensive sampling runs (with 7= 100,000 and 7=750,000), with the 
notable exception of (S@10, F@19). This latter cset is discovered at this low degree of 
25 sampling only in the r = 10 run, and does not appear in the more intensive sampling runs 
shown, both of which used r = 7. 

Table C5 displays the results for T= 100,000 and r = 7, and here it is clear that 
some separation of signal from noise is taking place amongst the set of HOFs, with 
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seventeen pairwise and three 3-ary correlations appearing within our Prob < 0. 1 significance 
level. 

At T= 750,000, we have more statistically significant detection of almost fifty 2-ary, 
3-ary and up through 6-ary attribute correlations, as shown in Tables C.6 and C.7. 

In order to get a better sense of the possible meanings of these results, let us consider 
these inter-attribute correlations along with some inter-column correlations in the form of 
pairwise mutual information estimates performed in our own analysis and also by the Los 
Alamos group. Table C.8 displays the highest estimated mutual information values amongst 
all 

N -N = 528 pairs of columns from our 33-column dataset. The estimates were obtained 
usjng a 

Bootstrap-like procedure in which 1000 sample data subsets of m = 300 out ofM= 657 
were drawn and run though the standard mutual information calculation. Reported in the 
table are therefore the mean values over the resampling and the associated standard error 
values. There is significant intersection between the set of column-pairs indicated by the top 
cset values in Tables C.6 and C.7 and those indicated by the top mutual information values 
in Table C.8. The correspondence between the two rankings is not perfect, for a few 
reasons (besides noise and simple sampling error). First and foremost, while the 
"suspiciousness" of a single joint-attribute combination certainly contributes to the mutual 
information within the corresponding set of columns the behaviour of the other symbols 
appearing within the columns obviously also can have great effect. Second, we note again 
the observed sensitivity coincidence detection results to the choice of r. 

Table C 9 lists the highest statistically significant mutual information values as 
25 estimated by the Los Alamos group. We note the overlap between their list and ours, but 
we emphasize again that group's use of an earlier, smaller, and perhaps otherwise different 
database to which we did not have access. 

Application of the coincidence detection method of the invention to biological data such as 
these aligned HIV sequences thus leads to identification of covarying structural elements 
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that were previously unrecognized. The statistically significant coincidence of particular 
structural elements, such as amino acid residues, likely indicates a biological role for a motif 
comprising the covarying elements, as structure and function are tightly linked in 
biochemical systems. One such example from the above application of the invention is the 
5 statistically significant coincidence of residues A18, Q31 and H33 in the V3 loop of HIV 

envelope protein. These residues are expected to contribute to a structural motif of the V3 
loop that plays a biological role in the HIV life cycle. Such new information about 
A18/Q31/H33, which prior to the invention have never before been grouped together for a 
particular biological role, may be exploited in various ways, as follows. 

A peptide or peptidomimetic mimicking the afore-mentioned structural motif of the 
V3 loop (or another protein motif identified by the coincidence detection method) is 
provided by the invention. For the chosen example, the peptide or peptidomimetic would 
include spatial coordinates of amino acid residues A18/Q31/H33, though every atom of these 
amino acids would not necessarily be required. Rather, the peptide or peptidomimetic 
would have such spatial coordinates of A18/Q31/H33, as well as topological and 
electrostatic attributes, that would make it useful for a biological function, such as, for 
example competing with the actual V3 loop of HTV for binding to another biological 
molecule, where such binding of V3 would employ the structural motif that is mimicked by 
the peptide or peptidomimetic. 

20 Alternatively, a peptide or peptidomimetic which is designed based on covarying k- 

tuples discovered by the coincidence detection method could be used as an antigen. That is, 
the biological function which the molecule mimics is eliciting an immune response in an 
animal. Similarly, vaccines embodying the covarying k-tuples described herein are also 
encompassed by the invention. 

25 Morgan and co-workers (Morgan et al 1989. In Annual Reports in Medicinal 

Chemistry. Ed.: Vinick, F.J. Academic Press, San Diego, CA, pp. 243-252.) define peptide 
mimetics as "structures which serve as appropriate substitutes for peptides in interactions 
with receptors and enzymes. The mimetic must possess not only affinity but also efficacy 
and substrate function " For purposes of this disclosure, the terms "peptide mimetic" and 
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"peptidomimetic" are used interchangeably according to the above excerpted definition. 
That is, a peptidomimetic exhibits fiinction(s) of a particular peptide, without restriction of 
structure. Peptidomimetics of the invention, e.g., analogues of the structural motif of the V3 
loop posited above, may include amino acid residues or other chemical moieties which 
5 provide the desired functional characteristics. 

The invention further provides a ligand that interacts with a protein having a 
structural motif identified using the coincidence detection method of the invention, as well as 
a pharmaceutical composition including the liganc and a pharmaceutical^ acceptable carrier 
or exicipient therefor. The ligand would include chemical moieties of suitable identity and 
10 spatially located relative to each other so that the moieties interact with corresponding 

residues or portions of the motif. By interacting with the motif, the ligand could interfere 
with function of that region of the protein including the motif 

Thus, the invention provides a pharmaceutical composition for interacting with an 
envelope protein of human immunodeficiency virus (HIV), including a ligand having a 

15 functional group that interacts with the structural motif of the V3 loop which has spatial 
coordinates of residues A18/Q31/H33, and a pharmaceutically acceptable carrier or 
exicipient therefor. The ligand may have more than one functional group that interacts with 
the motif, such as, for example, a first functional group capable of binding to and being 
present in an effective position in the ligand to bind to residue 18, a second functional group 

20 capable of binding to and being present in an effective position in the ligand to bind to 
residue 31, and a third functional group capable of binding to and being present in an 
effective position in said ligand to bind to residue 33. 

The invention further provides a metKbd of designing a ligand to interact with a 
structural motif of an protein, such as, for example, envelope protein of human 
25 immunodeficiency virus (HIV). For example, in the case where the motif is the potentially 
interesting A18/Q31/H33 motif identified by the coincidence detection method discussed 
above, the method of designing includes the steps of providing a template having spatial 
coordinates of residues A18, Q31 and H33 in the V3 loop of HIV envelope protein, and 
computationally evolving a chemical ligand using an effective algorithm with spatial 
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constraints, so that the evolved ligand includes at least one effective functional group that 
binds to the motif. The template provided may further include topological and/or 
electrostatic attributes, and the effective algorithm include topological and/or electrostatic 
constraints. Similar method steps would be employed for other proteins comprising a motif 
identified by the coincidence detection method. 

The invention further provides a method of identifying a ligand to bind with a 
structural motif of a protein. The structural motif is preferably identified by the coincidence 
detection method. For example, in the case where the motif is that identified by the 
coincidence detection method comprising residues A18, Q31 and H33 of HIV envelope 
protein discussed above, the method includes the steps of: providing a template having 
spatial coordinates of A18, Q31 and H33 in the V3 loop of HIV envelope protein, providing 
a data base containing structure and orientation of molecules, and screening the molecules in 
the data base to determine if they contain effective moieties spaced relative to each other so 
that the moieties interact with the motif The data base may further contain topological 
and/or electrostatic attributes of the molecules, and the screening step further include 
determining if the moieties are effective in such regard for interacting with the motif. For 
example, a molecule described in the data base may have such physical/chemical attributes 
that it includes a first moiety that interacts with residue 18, a second moiety that interacts 
with residue 3 1 and a third moiety that interacts with residue 33. Similar method steps 
would be employed for other proteins comprising a structural motif of interest. 

Where a ligand provided by the invention is included in a pharmaceutical 
composition, the pharmaceutical composition further includes a pharmaceutical^ acceptable 
carrier as is known to persons skilled in the art relating to pharmaceutical compositions. 
The term "pharmaceutical^ acceptable carrier" as used herein include diluents such as saline 
and aqueous buffer solutions and vehicles of solid, liquid or gas phase, as well as carriers 
such as liposomes (Strejan et aL 1984. J. Neuroimmunol 7:27), and dispersing agents such 
as glycerol, liquid polyethylene glycols, and the like. The pharmaceutical composition may 
include any of the solvents, dispersion media, coatings, stability enhancers, antibacterial and 
antifungal agents (for example, parabens, chlorobutanol, phenol, ascorbic acid, tlvmerosal), 
isotonic agents (for example, sodium chloride, sugars, polyalcohols such as mannitol) and 
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absorption delaying agents (for example, aluminum monostearate and gelatin) which are 
known in the art. 

Alternatively, a ligand provided by the invention, such as a ligand which binds to a 
biological target, may be employed for diagnostic purposes. A diagnostic agent according to 

5 the invention may include a ligand that interacts with a protein having a structural motif 

identified using the coincidence detection method, and a detectable label linked to the ligand. 
The detectable label may be any detectable substance known in the art, such as, for example, 
a fluorescent substance or a radioactive substance. Alternatively, the label may be an 
enzyme (such as, for example, horseradish peroxidase or alkaline phosphatase) which 

10 catalyzes a reaction having a detectable (e.g., colored) product, or the label may be the 
substrate for such an enzyme. 

Application of the Principles Described to Drug Discovery Background: 

The multi-billion dollar pharmaceutical industry is based in large part on the design 
or discovery and refinement of small molecules ("ligands") that interact with larger 

15 molecules ("targets") and in some way repress, enhance, block, accelerate or otherwise 
modify the structure, function or activity of the target. It is the structure, function or 
activity of the target that is in some way implicated in some mechanism of disease. The 
target molecule is often an enzyme or protein receptor or nucleic acid or some combination 
thereof. There are a great number of possible ligands and only some relatively very few of 

20 them are developed and marketed as therapeutic compounds that work with or against some 
one or more targets and thus are effective against disease. 

It is therefore of great interest to biotechnology and pharmaceutical researchers to be 
able to consider a huge number of potentially useful compounds, but to avoid spending too 
many resources developing therapies based on compounds that may turn out not to be 
25 useful, safe, effective, and economically viable. The methods described herein can be used 
to enhance and accelerate the process of discovering good, effective compounds and of 
distinguishing the promising compounds from the unpromising or less promising compounds 
in a public or private collection of molecules or their computer database representations. 
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They can be used effectively and contribute value in this application in many ways, by 
helping to understand and infer target structures and by finding ligands whose geometric, 
topological, electrostatic or other features make them likely candidates for effective 
interaction with the targets. 

5 Application of the Principles Described Herein to Databases of Molecules and 

their Features 

One way to represent a large number of molecular structures within a computer 
database (whether stored in main memory, on magnetic disk, tape, or other electronic or 
optical media) is in terms of "screens". Persons skilled in the art will recognize screens as 

10 binary attributes wherein a given screen, or attribute, represents the presence or absence of a 
particular substructure pattern, for example, a sulfate group. If a set of compounds is 
represented with screens, then a particular compound, which we will denote by C, can be 
represented by a string of Is and 0s wherein the Is stand for those pre-defined substructure 
patterns that C contains and the 0s stand for those of the pre-defined substructure patterns 

15 that C does not contain. 

This scheme can be extended to the representation of the primary structure of a 
nucleic acid or protein in terms of attributes, as discussed elsewhere herein. The primary 
structure is also known as the "sequence", that is, a sequence of bases, or nucleotides, in 
DNA or RNA, and a sequence of amino acids, also called amino acid residues, in a protein. 

20 It is simple to represent a protein sequence, for example, as a sequence of symbols, each 

symbol being a letter of the alphabet corresponding to one of the twenty standard naturally- 
occurring amino acids. It is also simple to transform this representation by representing each 
residue, or position, in the sequence by a set of twenty binary attributes, if such a 
representation is desired. The attributes act like the screens described above. For example, 

25 if the first amino acid in protein P is an alanine, represented by A, it can also be represented 
by a value of "1" in the attribute that stands for the question, "Is the amino acid in position 1 
an alanine?", and by values of "0" for the attributes representing "Is the amino acid in 
position 1 a cysteine?", "Is it a phenylalanine?", and so on. Figure 15 provides an illustration 
of amino acid and residues positions. 
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It is also easy and sensible to represent other aspects or features of the compounds in 
terms of attributes. For example, a given compound C may be known to be active against a 
particular target T, in which case an attribute corresponding to the question "Active against 
T?" would have the value 1 for the object corresponding to compound C. For another 

5 example, a pharmaceutical company may have run a number of compounds through a set of 
"assays", or tests of biological or chemical activity. An assay might test for some aspect of 
effectiveness against a target, or for ability to cross the blood-brain barrier, or for toxicity, 
for example. Assay results can be represented in terms of discrete-valued, and even binary 
attributes as well, via preprocessing routines known to persons in the art. Other features of 

10 particular compounds can include literature citations (that is, references to papers or studies 
in which the compound was described, designed, discovered or analyzed), and ownership or 
patent status of the compound. 

Not only can small therapeutic compounds be represented in terms of screens and 
other attributes, but so can larger potentially therapeutic molecules such as DNA, RNA, 

1 5 peptides, proteins, carbohydrates and lipids. Target molecules can also be represented in 
this way. All that is required is a predefined (though possibly updated, changing, shrinking 
or growing) list of substructural patterns or other features deemed important by the 
researchers or users. For target structures, one might want to represent substructural 
patterns as well as their 1 -dimensional linear structures ("sequence"), genetic linkage 

20 information, interactions with other proteins in disease pathways, literature citations, and so 
on. Sometimes a particular molecule might be listed as more than one object in a database, 
the different objects representing different conformations that the molecule can take. 

Clearly, this use of screens and other attributes in representing compound databases 
can also be represented in terms of the M by N data matrix we have used to describe the 
25 working of the invention. The M by N data matrix is illustrated below in Table 1 

The rows in Table 1 correspond to a set of molecules, compounds, molecular 
structures or sequences, while the columns correspond to features that may include 
substructural patterns, assay results or other aspects of the molecules. The value in table 
cell[i, j] is one (1) if molecule i has feature j and is zero (0) otherwise. 
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Feature 1 


Feature 2 




Feature N 


Molecule 1 


1 


0 




1 


Molecule 2 


0 


1 




0 












Molecule M 


0 


0 




1 













Table 1 



Steps involved in applying the methods described herein to the analysis of a molecular 
database include: 

1 . Obtain molecular database that supports discrete attribute representation for the ID, 
2D and/or 3D molecular structures of interest (or, obtain molecular database and use 
standard methods to produce such a representation); also use standard methods to 
transform sequence and other information about molecules of interest into attribute 
representations. 

2. Present this database, in whole or part, to an embodiment of the current invention 
such that each compound in the database corresponds to one or more of the M 
objects (rows) in the embodiment's data matrix and so that each screen-represented 
substructure pattern corresponds to an attribute (column) of the data matrix. The 
additional attributes representing activity, assay results, knovoi targets against which 
the compound has been used, source or means of production or storage of the 
compound, ownership or patent status of the compound, and so on, plus the 
substructure pattern attributes together comprise the N attributes (columns) in the 
data matrix. 

3. Employ the base method above or one of the other embodiments described herein on 
the data matrix. 

4. Direct the discovered correlated k-tuples of attributes to: 
A graphical viewer, or 

• A rule-generator preprocessor for rule-based system, or 

• A report for users, researchers or managers, or a report-generation system, or 
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• Another computer program that performs some kind of further analysis of the 
compounds, sequences, or structures represented in the database, or 

• Another computer program that performs some transformation or optimization 
on the database, or 

5 • Another computer program that directs humans and/or robots in drug screening 

experiments or in design, refinement or production of therapeutic compounds. 

The output of the current invention, in this drug discovery application, can be useful 
in many possible ways. 

First, it can be used in setting up or optimizing a screen-based representation of 
10 molecules. For example, it is known in the art that a good screen-based representation 
should use a set of screens (attributes) that are mutually uncorrelated and roughly 
equiprobable. The method of the current invention would produce, when used as described 
above, sets of correlated screens; this information can be used to add, remove, or combine 
the features that the screens represent, in order to make the modified set of screens closer to 
1 5 the ideal of uncorrelated and equiprobable. 

Other useful and valuable aspects of the information produced by the method include 
the following. 

For example, it is not uncommon for a pharmaceutical company to have good "lead 
compounds" that work in in vivo or in vitro experiments even when the researchers do not 
20 know the target structure, the active site on the target structure, or even which of several 
proteins in the biological system is the target. If the methods described herein are used to 
discover correlations among substructural patterns and assay results, this information can aid 
in inferring a target structure and designing even more effective lead compounds, because it 
allows researchers to associate structure with desired activity. 

25 Another example is that of finding correlated amino acid residues in that part of a 

drug discovery database corresponding to an aligned set of DNA, RNA or protein 
sequences, as discussed later herein. In this case, some of the correlated k-tuples of residues 
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(positions) may correspond to evolutionarily conserved structural and functional 
relationships. Therefore the principles described herein can in this way be used to help 
predict or solve the structure and function of important biological macromolecules, including 
pharmaceutical targets such as receptors and enzymes. 

5 Another example is to find correlations between structural, functional, disease 

pathway or other aspects of one target molecule, Tl, and another target molecule, T2; or 
finding correlations between structural, functional or other aspects of a set of potential 
therapeutic compounds aimed at Tl and those of a set of potential therapeutic compounds 
aimed at T2. In either case, this correlation information is useful because it allows drug 

10 designers to apply knowledge, compounds and techniques effective against Tl to the effort 
against T2. 

Another rather different application of the principles described herein to drug 
discovery and medical science is obtained by considering the transpose of data matrix 
described above. Instead of compounds as objects (rows) and features of the compounds as 

15 attributes (columns), consider what is possible when the compounds correspond to columns 
and their features correspond to rows. See Table 2 below. Use of the current invention in 
this scenario produces correlated k-tuples of compounds in feature-space. These produced 
k-tuples can embody several kinds of valuable information. For example, if the features in 
the rows represent mostly substructural patterns (screens), then the produced k-tuples 

20 correspond to clusters of compounds. Such clustering of compound databases is very useful 
in high-throughput screening (HTS), with both biological/chemical assays (in vitro or in 
vitro) and computational assays. In HTS, it is useful and economical to assay only one or a 
few members of each cluster of compounds initially; then, only in the cases where a "hit" 
occurs (that is, a compound "passes" the "test" in the assay of biological or chemical 

25 activity) do other members of the corresponding cluster get sent through the assay. 

Use of the method on the "transpose" of the molecular database shown earlier, in 
order to cluster the compounds in feature-space is shown in Table 2. It is now the columns 
that correspond to a set of molecules, compounds, molecular structures or sequences, while 
the rows correspond to features that may include substructural patterns, assay results or 
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other aspects of the molecules. There are NT rows and N' columns, where perhaps M'=N 
and N'=M, for the original M and N described above. The value in table celllj, i] is one (1) if 
molecule i has feature j and is zero (0) otherwise. 





Molecule 1 


Molecule 2 




Molecule N' 


Feature 1 


1 


0 




1 


Feature 2 


0 


1 




0 












Feature M' 


0 


0 




1 













Table 2 



Application of the Principles Described Herein to Discover and Analyze 
10 Genetic Networks 

Advanced molecular biological and computational techniques applied in large-scale 
genome mapping and sequencing efforts are beginning to give us access to the sequences of 
complete genomes, the complete expression patterns of genes, and the ability to store and 
manipulate this information. Such information can be used to accelerate the discovery of 
15 new disease targets and successful therapeutic compounds. It is known that the genes that 
form the "blueprint" for particular physical traits and systems within an organism often act 
together in complex ways. Genes interact in mutually regulatory ways, promoting, 
repressing and otherwise modulating their own and each others' activation and expression. 

Traditionally, molecular biology has focused on the study of individual genes in 
20 isolation. However, to understand complex biological phenomena like neural development 
or oncogenesis, for example, it is necessary to study the expression patterns of tens or 
hundreds of genes in parallel, taking into account temporal patterns as well as anatomical 
patterns. Such analysis requires novel computational and statistical capabilities, such as 
those provided by the principles described herein. 
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While many variations are possible and can be envisioned by those in the art, a basic 
scheme for employing the methods described herein in the analysis of genetic networks 
might include the following steps: 

Step 1 : Select the genes of interest. 

5 Step 2: Select the biological parameters by which to represent the status of a gene at a 
particular time. Biological parameters can include: expression of a gene (concentration 
levels of the associated mRNA or protein product, a particular status of a protein such as a 
biologically relevant phosphorylation or any other post-translational modification, the 
location of a given protein, or the presence or absence of a cofactor. For example, one can 

10 use polymerase chain reaction (PCR) techniques to amplify, then use known methods to 

detect mRNA levels for each gene, then normalize these by dividing by maximum expression 
levels for each gene, and then quantize these continuously varying levels into a set of z 
discrete levels that can be represented in the data matrix format described throughout this 
document. It is also possible to use concentration levels of protein products as indicators of 

1 5 gene activity and interactivity. The change, over timed observations, of concentrations of 

proteins is governed mainly by three processes: direct regulation of protein synthesis from a 
given gene by the protein products of other genes (including auto-regulation as a special 
case); transport of molecules between cell nuclei; and decay of protein concentrations. 

Step 3: Select a scheme for time-sampling the biological parameters of the genes in the 
20 genetic system under analysis. At each appropriate time, use methods known in the art to 
measure the selected biological parameters for the selected genes. 

Step 4: Represent the selected genes in terms of the selected biological parameters, and 
represent the measured values of the biological parameters as attributes in the data matrix. 
Represent the time-samples (the instances of measurement of the biological parameters) as 
25 rows in the data matrix. That is, for a cell in the data matrix, in the ith row and jth column, 
enter the quantity or feature measured in the /th time-sample for the yth biological paramter 
(which may correspond to the yth gene, or it may not, depending upon whether on? or more 
parameters are measured for each gene). The recorded quantity, level or feature may be 
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binary (e.g., the gene is "on" or "off'), or may be one of z discrete values. As described 
elsewhere in this document, any discrete-valued attribute can be represented by a binary 
encoding of whether that value is absent or present in a given object, so that any of the 
preferred embodiments of the current invention can be applied to data of this type. 

5 Step 5: Employ the base method described above or one of the other embodiments 
described herein on the data matrix. 

The output of the above steps, that is, a set of k-tuples of correlated attributes, can 
be interpreted as a set of cliques of correlated genes. For example, one might discover that 
one gene is "on" whenever another gene is "on". Or one might discover that when one gene 

10 Gl is in "low expression", another gene G2 is "off"; when Gl is in "medium expression", G2 
is in "low expression"; and when Gl is in "high expression", then G2 is in "medium 
expression". Such a result might lend support to the hypothesis that Gl promotes the 
expression of G2, or that "Gl turns G2 on". Similarly, correlated k-tuples of genes or 
biological parameters might provide evidence that one gene represses, or "turns off" another 

15 gene or set of genes, and so on. All such information can be useful in building a model, for 
example a "boolean network", of a set of interacting genes. Such models are known to 
those in the art as providing valuable assistance in diagnosing, preventing and curing disease 
and in designing effective and economically valuable therapeutics. 

The rows in Table 3 correspond to a set of time-samples (a.k.a., time points, time- 
20 slices), that is, times or periods of observance of the activity of a particular gene or gene 
product. The columns correspond to particular genes or gene products. The value in table 
cell[i, j] is one (1) if gene i is considered "on", that is, e.g., "active" or "expressed", during 
time j and is zero (0) otherwise. This representation and application is easily extended to 
situations in which the simple on/off status of a gene is replaced by a set of z distinct levels 
25 of expression, for example, as measured by observed quantities of a gene's main protein 

product. It is also easily extended to situations in which more than one biological parameter 
is used to represent the status of a single gene. 
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Time 1 
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1 


Time 2 
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1 




0 












Time M 
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5 Table 3 

The methods described herein have been applied to a set of gene expression data for 
genes involved in the development of spinal cord in rats, as described in (G.S. Michaels, 
D.B. Carr, M. Askenazi, S. Furhman, X. Wen, and R. Somogyi, Pacific Symposium on 
Biocomputing 3:42-53, 1988). The dataset is available from those authors and as of March, 
10 1998 is also available over the world-wide web (WWW) at http://rsb.info ^ih f ftv/mft1 - 
phvsiol^>NAS/GEMtable.html . 

Using a reverse-transcriptase polymerase chain reaction (RT-PCR) protocol, the 
expression of 1 12 genes (mRNA levels, normalized by maximal expression level) was 
assayed over nine developmental time points (El 1, E13, E15, E18, E21, P0, P7, P14, and 
15 P90 or adult, wherein E=embryonic, and P=postnatal). Included in the list of genes used are 
genes considered important in CNS (Central Nervous System) development covering nine 
major gene families. 

The dataset mentioned above was easily transformed into a data matrix of objects 
and attributes, convenient for analysis with the methods described herein, in a few steps: 

20 1 . The real-valued (that is, continuously- valued) gene expression levels were 

transformed into a set of discrete values by use of a Bayesian clustering method 
as embodied in the SNOB software, described in (C.S. Wallace and D.L. Dowe, 
"Intrinsic Classification by MML - the SNOB program", Proceedings of the 
Seventh Australian Joint Conference on Artificial Intelligence, pp.37-44, 1994). 

25 Bayesian methods of quantizing or discretizing real numbers are well known to 

persons skilled in the art. For convenience of interpreting output, these six 
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discrete numerical values were then further transformed into a small set of 
alphabetic symbols, A through F. 
2. A data matrix was set up such that the columns of the matrix correspond to the 

112 different genes and such that the rows of the matrix correspond to the nine 
5 different developmental time points. 

The methods described herein were then run on the transformed gene dataset input, several 
times, each time using a different combination of values for the parameters r (sample size) 
and T (number of sampling iterations). The method can be applied to this dataset by use of 
10 a computer program very similar to the embodiment described in Appendices A and D, 
however, that particular embodiment was tailored for application to the protein sequence 
analysis domain, meaning that some of the parameter values were fixed to be appropriate for 
those particular trials on the HTV protein data. The program must be modified to allow for 
parameter values appropriate to the input data. 

15 These runs on the gene expression data were performed on an IBM PC-compatible 

computer under the Windows '95 operating system. For each run, a table of results was 
printed out for viewing and analysis. The results of one run, for 7=100,000 and r=5, is 
attached as Appendix E. A researcher may wish to only print out the top 10, or 50, or 1000 
(or any other number) most highly correlated k-tuples of genes. In Appendix E, the top 25 
20 are shown. 

In the attached results printout, the following format convention was used: 
Each group of one or more lines reports one correlated k-tuple of genes, that is, one 
cset (coincidence set) which displayed a low probability of its individual component 
attributes being statistically independent, as described elsewhere in this document. 
Low probability of independence is a form of high correlation, as known to persons 
skilled in the art and as explained earlier in this document. For each k-tuple, the k 
genes are shown, followed by a numerical value for their probability of 
independence. (This number often displays as zero, because the calculated value is 
so small, so close to zero, that the decimal expansion is truncated to zero). Again, 
low probability value means high degree of correlation. For each gene, the symbol in 
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A...F is shown, representing the quantized level of expression, followed by the 
internal dataset name for the gene, followed by the more standard accepted name for 
the gene. 

The correlated k-tuples produced can be compared to the results reported by the 
authors in the aforementioned scientific paper. Among the analysis methods employed by 
those authors on this gene expression dataset was a pairwise mutual information analysis. In 
such analysis, a particular correlation measure, known as mutual information, was measured 
for each pair of the 1 12 genes, and the results were displayed graphically so that groups of 
genes with mutually high mutual information tend to appear close to each other. The 
method described herein is able, as shown by the results in Appendix E, to discover not only 
highly-correlated pairs of genes, but also 3-tuples, 4-tuples, and so on. Examination of the 
results in Appendix E and the results of the authors of the previously cited scientific paper 
shows that the two different methods tend to corroborate each other but that the current 
method goes farther in finding correlations among large numbers of attributes. For example, 
an examination of any line of output of our results reveals a set of correlated genes such that 
the different pairs of genes in that set are usually also listed as having high pairwise mutual 
information by the other authors' method. 

It is not always true that a correlated k-tuple of attributes implies that all possible pairs, 
from that k-tuple, are also mutually correlated, nor vice versa. Therefore, a method like 
those described herein, that can find pairwise and higher-order k-ary correlations, offers 
advantages over pairwise methods which can fail to detect important higher-order 
correlations among genes or among other attributes in other applications. 



25 



Application of the Principles Described Herein to the Discovery of Categories 
in Internet/Intranet Document Databases for Use in Document Search Engines 

Document search by topic or keyword implies the existence of an efficient search 
engine and, indeed, much effort has been applied to the development of effective search 
algorithms. This, however, only represents a part of the total solution - the problem also 
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requires an effective document categorization strategy. Information theory dictates that an 
effective set of categories, or topics, used to organize documents should be uncorrelated and 
roughly equiprobable. When these topics occur with widely-varying probabilities, the search 
space of documents will be either too broadly or too narrowly divided by some topics. If 

5 correlations exist between the topics (that is, where knowledge of the existence of a topic 
within a given document implies a greater probability that other topics will be found within 
the document as well) then the topic set can be reduced in size (by removing some of the 
correlated topics from the categorization set). The "equiprobability" concern can be 
addressed by the application of the principles described herein. This problem yields readily to 

10 statistical techniques, but standard statistical techniques usually fail to capture higher-order 
joint probability terms. The "decorTeiation" problem is much more subtle and intractable. A 
sub-optimal topic set forces the search engine to examine more such topics than necessary 
before the results can be returned to the users (and may confuse interpretation of the 
organization of the documents themselves). Given that every increment in search efficiency 

15 allows greater numbers of users to use the system, the developers of such systems can not 
afford a lack of effective categorization of documents. 

Application of the method to optimal or near-optimal topic set reduction can also be 
represented in terms of the M by N data matrix we have used to describe the working of the 
invention in other sections of this document. In one application-specific embodiment, the 
20 rows of the data matrix correspond to particular documents in the database; and the columns 
correspond to a proposed topic set that is intended to categorize them. (See Table 6). 

The rows in Table 6 correspond to documents in a database, while the columns 
correspond to proposed topics used to classify them. The value in table cell[i, j] is one (1) if 
document i mentions topic j and is zero (0) otherwise. 





Topic 1 


Topic 2 




Topic N 


Document 1 


1 


0 




1 


Document 2 


0 


1 




0 












Document M 


0 


0 




1 
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Table 6 

Steps involved in applying current invention to a search for a near-optimal topic set 
with which to classify a set of documents include: 

1 . Obtain an initial topic set. The field of document search is well established and 

5 effective methodologies for the creation of such sets are known to those skilled in 

the art. 

2. Create the database using this topic set and the set of documents that the topic set 
categorizes. Given the topic set, all one need do is examine each document to 
determine whether or not it mentions each topic. 

10 3. Present this database, in whole or part, such that each document in the database 

corresponds to one or more of the M objects (rows) in the embodiment's data matrix 
and so that each proposed topic corresponds to an attribute (column) of the data 
matrix. 

4. Employ the base method above or one of the other embodiments described herein on 
15 the data matrix. 

5. Direct the discovered correlated k-tuples of attributes to: 

• A graphical viewer or printer, or 

• A rule-generator preprocessor for rule-based system, or 

• A report for administrators or other users of the computer database query 
20 system, or a report-generation system, or 

• Another computer program that performs some kind of further analysis of the 
data, for example, performing more in-depth statistical analysis (e.g., multiple 
regression) on the correlated variables, or 

• Another computer program that performs some transformation or 
25 optimization on the database. 
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Any statistically significant correlation between topics in the topic set may indicate 
an ineffective initial choice of topics. The correlated k-tuples discovered by the method of 
the current invention correspond both to "highly correlated topics" (with respect to the 
"decorrelated topics" goal) and to "highly probable joint topics" (with respect to the 
"roughly equiprobable topics" goal). A person skilled in the art can use the correlations 
output in this application, as a guide to determining which topic(s) found to co-occur should 
be removed or combined from the topic set. Using the output of the application in this way 
would allow the administrator of such a document search engine to increase the performance 
of the system by reducing the number of categories to be searched in response to a user's 
query. The enhanced performance of the system would benefit the provider of the service in 
two ways: the response time of the system to user's queries would decrease and the total 
number of users that can be served would increase. 

Applications of the Principles Described Herein to Internet and Intranet 
Search and Storage 

Internet and intranet search engines can be ranked subjectively by examining the 
length of time needed for users to find sites or documents of relevance to their query. Any 
improvement to the underlying algorithms that drive the search engine's output that allows 
users to find what they're looking for sooner improves the usefulness of that engine, allows it 
to serve more users and makes it more attractive to both the communities of users and 
advertisers (in the case of internet search) and users and management (in the case of 
company intranet search). Presented below are two uses of the principles described herein 
that will provide ways to get relevant information to users sooner and to better manage the 
storage of documents on internet or intranet search systems. In the descriptions and 
examples below, the principles discussed apply equally whether one is considering the 
internet/web and hence individual web pages and websites, or intranets, maintained within 
the information systems of a single company or other institution, in which case the search is 
for documents rather than websites per se. 

For the purposes of elucidating this description, assume that each page in the set of 
web pages, or internal intranet documents in the set of such documents, known to the search 
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engine has already been classified by topic and that the set of topics is fixed a priori. The 
goal is to present the user with the normal output of the search engine but to supplement 
that list of links with an additional list of topics known to be i elated to the user's request. 

The rows in Table 7 correspond to a set of web pages, or internal intranet 
5 documents, while the columns correspond to topics. The value in table cell[i, j] is one (1) if 
web page or document i mentions topic j and is zero (0) otherwise. 





Topic 1 


Topic 2 




Topic N 


Page 1 


1 


0 




1 


Page 2 


0 


1 




0 












Page M 


0 


0 




1 



Table 7 



Table 7 illustrates the database upon which the base method or other embodiment 
described herein will be run, in the data matrix format for representing objects and attributes 
that have been defined and described elsewhere herein. Note that, because of the 
15 characteristics of the embodiments described herein, the number of pages used in the table 
need not be the entire set of all web pages. The embodiment, when run (or employed) on 
this table will find those topics that are frequently found in the same document together. 
This indicates that these topics are related in some fashion and, as the set of web pages 
supports their association, they may be of interest to the user as well. 

20 The advantages are several. The computational expense of these embodiments scales 

linearly with respect to the number of columns in the database. In this application, the 
number of columns represents the number of topics associated with web pages. As this 
number is almost certainly very large, this characteristic of the method is a real benefit. In 
addition, if the web pages are kept in random order, the embodiments can be run on more 

25 manageable subsets of the entire set of web pages. This allows the job of finding these 
associations to be divided into much smaller jobs which can be run, serially or in parallel, 
during idle times on the server where the search engine resides. This method can produce 
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novel associations of great width (k) at any point during its execution. Many other 
"association mining" methods only find longer k-tuples of associated attributes at later stages 
in their long execution times. Lastly, as the list of associated topics found by this algorithm 
grows, the pages that select the links for these new "joint topics" can be created and cached. 
This would reduce server loads (thus allowing more users to access the system). As this 
also puts bounds on the statistical relevance of the findings, this information could be used to 
select which new topic indices would be cached and which would be re-created as needed. 

Alternative Application of the Principles Described Herein to Manage the 
Storage and Retrieval of Web Pages and Documents: 

Internet and intranet search engines attempt to order the space of web pages or 
documents by topic. Generally, an initial (e.g. alphabetic) ordering is not at all likely to 
evenly divide that space. For example, the topic "California" will have a vastly greater set of 
pages associated with it than will "North Dakota". A simple tree-like storage of the pages 
by topic (with sub-topics at lower levels of the tree) will leave "California" with a very deep 
tree. What would be of use in this situation would be some better way to divide the search 
space of pages than by just single topics. In the noted example, it would be better to have 
the large set of California-related web pages divided into smaller sets closer to the size of the 
set for North Dakota. We can keep our ordering of the pages by topic if we choose to 
divide larger sets into smaller ones by replacing the single topic describing the set with a 
series of associated topic lists that encompasses the same space. Going back to our 
example, if "California" were only strongly associated with "Sunshine", "Wine" and "Cars" 
we would replace the tree node "California" with the set of nodes "California and Sunshine", 
"California and Wine", "California and Cars", "California and Other". This will allow faster 
lookup and storage of these pages because it reduces the height of this part of the tree (in 
this case) by one. Recursively applying the same technique at all nodes in the tree would 
provide a method for ensuring better balance than could have been had before. The only 
thing missing from this formulation of the new tree balancing function is the discovery of the 
associations themselves. An application of embodiments described herein to the same table 
discussed in the previous section extracts this information from the set of pages. The 
method tells us not only which topics are related but also gives an indication of the level of 
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support for each association in the database. Once a problematically large topic has been 
identified, the list of associations found by the algorithm that includes this topic can be 
consulted to determine how to divide the topic. 

The use of tree-based storage retrieval techniques is known to those in the art, and 
such methods include such variations as B-trees, k-D trees, tries, k-D tries, and gridfiles. 
Hashing schemes can also be used instead of, or in addition to, tree-based methods per se. 
With all such methods, there are efficiency gains to be made, in both storage (main memory 
and offline memory) and running time, by taking advantage of particular distributions of the 
data in the application domain. The embodiments described herein can, as shown above and 
in other ways, be used to obtain a better understanding of and exploitation of the distribution 
of the data. 

The advantages include all those listed for the first alternative above with one 
significant addition - if one is already using the method to find lists of sites related to a given 
query, then one is already compiling the exact list of associations that is needed here to help 
balance the search tree. 

Application of the Principles Described Herein to Sales Analysis, Direct Mail 
and Related Marketing Activities 

Marketing executives, within retail sales companies, advertising/marketing agencies, 
magazine, newspaper, radio, television, film and internet companies, and non-profit and 
charitable organizations, need to know which kinds of people are likely to buy or contribute. 
In all these and other marketing contexts, it is very useful and valuable to be able to analyze 
data both from previous marketing campaigns (we'll use the term "mailings", though other 
campaigns and promotions are also included) and from previous purchases of the relevant 
good and services, or previous contributions to charities (let us refer to all these as 
"products"). 

It is useful for marketing executives, salespeople and management to know such 
things as, for example: 
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Which products tend to be bought together (by same customer, perhaps within same 
transaction)? 

Which of our previous advertising campaigns or mailings produced good response 
(high sales of a product) and which did not? 

5 Which demographic factors correlated with large total spending on our companies 

products last year? Are 25-40 year old females in the Midwest region buying our 
products? 

Such questions can be addressed by the analysis of databases organized in terms of 
customers, transactions, demographic factors, previous marketing campaigns, and sales of 
particular products. For charitable organizations, the basic idea is the same, though instead 
of "sales" and "customers" the application is to "contributions" and "donors", for example. 
The principles described herein can be applied successfully to these analysis tasks, wherein 
one of the main current computational challenges is the discovery of associations 
(correlations) amongst sets of variables or attributes in very large databases Table 8 
illustrates the application to the analysis of databases on customer purchases of products. 
Table 9 is similar except that it illustrates the case wherein not only purchases are recorded 
in the data, but also information on previous marketing campaigns. Either of these schemes 
may be augmented by the inclusion of additional columns corresponding to demographic 
attributes of the customers, for example region of residence, age group, income group, 
gender, occupational category, and participation in community- or leisure-related activities. 

The rows in Table 8 correspond to customers (and/or potential customers), while the 
columns correspond to products (goods or services) that were either purchased (denoted by 
1) or not purchased (denoted by 0) by particular customers. The value in table cellfi, j] is 
one (1) if customer i has purchased product j and is zero (0) otherwise. 





Product 1 


Product 2 




Product N 


Customer 1 


1 


0 




1 


Customer 2 


0 


1 




0 
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Customer M 


0 


0 




1 



Table 8 



The rows in Table 9 correspond to customers (and/or potential customers), while the 
5 columns correspond to mailings (or other marketing campaigns) and products (goods or 
services) that were either purchased (denoted by 1) or not purchased (denoted by 0) by 
particular customers For the Mailing columns, the value in table cell[i, j] is one (1) if 
customer i was sent mailing j and is zero (0) otherwise. For the product columns, the value 
in table cell[i, j] is one (1) if customer i has purchased product j and is zero (0) otherwise. 





Mailing 1 




Mailing nl 


Product 1 




Product n2 


Customer 1 


1 




0 


0 




1 


Customer 2 


1 




1 


0 


* * *• 


0 
















Customer M 


0 




1 


1 




0 



Table 9 



15 Steps involved in applying the principles described herein to a sales/marketing 

database include: 

1 . Obtain sales/marketing database as described above. Where necessary, use methods 
known in the art to transform continuous-valued variables into discrete-state 
variables. 

20 2. Present this database, in whole or part, such that each customer in the database 

corresponds to one or more of the M objects (rows) in the embodiment's data matrix 
and so that each product or mailing corresponds to an attribute (column) of the data 
matrix. Mailing attributes (if any) plus product attributes together comprise the N 
attributes (columns) in the data matrix. 
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3 Employ the base method above or one of the other embodiments described herein on 
the data matrix. 

4. Direct the discovered correlated k-tuples of attributes to: 
A graphical viewer or printer, or 

• A rule-generator preprocessor for rule-based system, or 

• A report for marketing personnel, magazine/newspaper circulation directors, 
salespeople, managers or other users of the computer database query system, 
or a report-generation system, or 

• Another computer program that performs some kind of further analysis of the 
data, for example, performing more in-depth statistical analysis (e.g., multiple 
regression) on the correlated variables, or 

• Another computer program that performs some transformation or 
optimization on the database. 

The output in this application, can be useful in several possible ways. 

For example, the output may include correlated k-tuples which comprise sets of 
products that tend to be bought together, either within the same transaction or by the same 
customer across different transactions. Such information can be used to develop "tie-in" and 
co-marketing campaigns, such as, for example, when buyers of NBA basketball tickets are 
given coupons for discounts on NBA team shirts, basketball shoes, and other basketball- 
related merchandise. While it is perhaps not surprising that basketball fans like to wear NBA 
team shirts, the steps described above are capable of discovering other associations between 
products that are not so obvious. 

For another example, the output may include correlated k-tuples which represent 
particular advertising campaigns correlated with particular product purchases. Such 
information can help marketing executives focus their recourses on new marketing 
campaigns of the type most likely to increase sales. 
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Use of the Principles Described Herein in Clustering Customer Data 

Another rather different application of the principles described herein to marketing 
practice is obtained by considering the transpose of the data matrix described above. 
Instead of customers as objects (rows) and products and demographic factors as attributes 

5 (columns), consider what is possible when the customers correspond to columns and the 
product and demographic variables correspond to rows. (See Table 10). Use of the 
principles described herein to this scenario produces correlated k-tuples of customers, or 
customer profiles, in the space of demographic and purchasing pattern features. This is seen 
to be a form of clustering of the customer data, into groups of customers or customer 

0 profiles that are roughly similar in terms of their buying habits and lifestyles. Such clustering 
can be useful in designating special "target groups", to enable more optimal allocation of 
marketing resources. Once this transposition of the data is envisioned, the other steps apply 
entirely analogously to the descriptions given above for marketing activities. 

Use of the method on the "transpose" of the marketing database shown earlier, in 
15 order to cluster the customers is shown in Table 10. It is now the columns that correspond 
to a set of customers, while the rows now correspond to products purchased and 
demographic features. There are M' rows and N' columns, where perhaps M'=N and 
N'=M, for the original M and N described above. The value in table cell[j, i] is one (1) if 
customer i purchased product j or possesses demographic feature j and is zero (0) 
20 otherwise. 





Customer 1 


Customer 2 




Customer N' 


Prod/Demo 1 


1 


0 




1 


Prod/Demo 2 


0 


1 




0 












Prod/Demo M' 


0 


0 




1 



Table 10 
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Application of the Principles Described Herein to the Analysis of Medical, 
Epidemiological and/or Public Health Databases 

Medical scientists and practitioners have long known that many human diseases and 
disorders, physical and mental, are caused by complex interactions among many potential 
5 contributing factors. Such factors can include particular genetic conditions or abnormalities, 
exposure to biological pathogens, aspects of diet, environment (air, water, noise pollution), 
exposure to hazards in the home or workplace, emotional stress, substance abuse and 
poverty, among others. The true "causes" of a given condition often remains impossible to 
ascertain, though there is much folklore and anecdotal evidence offered in attempts to 

10 explain some instances. The problem of discovery and prevention of health threats is helped 
in recent times by the ability of researchers, insurance company representatives, 
epidemiologists and public health officials to compile and analyze large amounts of data on 
real people, healthy and sick, living and deceased. As in other applications of computers 
and statistical analysis to databases, one must contend in this field with a huge number of 

15 variables and the exponential complexity of their potential interactions. This kind of analysis 
can be improved greatly by methods that efficiently find correlations and associations 
amongst tens, hundreds, or thousands of variables. The principles described herein are 
applicable to such a situation. 

Application to medical databases can also be represented in terms of the M by N data 
20 matrix we have used in other sections of this document. In one application-specific 

embodiment, the rows of the data matrix correspond to particular patients or subjects in a 
health study; and the columns correspond to factors thought to contribute to a given disease 
or set of diseases. Again, these factors can include socioeconomic factors, lifestyle 
(exercise, diet), aspects of the patient* s home or workplace environment (e.g., exposure to 
25 carcinogenic chemicals), past medical treatments, and so on. (See Table 1 1). 

The rows in Table 1 1 correspond to patients or to human subjects in a study, while 
the columns correspond to potential disease factors. The value in table cell[i, j] is one (1) if 
patient i has experienced or been exposed to factor j and is zero (0) otherwise. 
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Factor 1 


Factor 2 




Factor N 


Patient 1 


1 


0 




1 


Patient 2 


0 


1 




0 












Patient M 


0 


0 




1 



Table 11 



In some application-specific embodiments, there may be not just one disease 
represented implicitly, but, instead, a number of different diseases, represented as attributes 
along with the factors shown in Table 1 1 and described above. For example, a particular 
patient p may have lung cancer but not diabetes or heart disease, and so row p would have a 
1 in the column corresponding to lung cancer and have values of 0 for the columns 
corresponding to diabetes and heart disease. 

Steps involved in applying current invention to a medical/epidemiological/lifestyle 
factors database include: 

1 . Obtain database of medical/epidemiological/lifestyle factors as described above. 
Where necessary, use methods known in the art to transform continuous-valued 
variables into discrete-state variables. 

2. Present this database, in whole or part, such that each patient/subject in the database 
corresponds to one or more of the M objects (rows) in the embodiment's data matrix 
and so that each potential disease factor corresponds to an attribute (column) of the 
data matrix. Additional attributes representing different diseases plus the disease 
factors together comprise the N attributes (columns) in the data matrix. 

3. Employ the base method or other embodiments described herein on the data matrix. 

4. Direct the discovered correlated k-tuples of attributes to: 

• A graphical viewer or printer, or 

• A rule-generator preprocessor for rule-based system, or 
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• A report for doctors, researchers, public health officials, managers or other 
users of the computer database query system, or a report-generation system, 
or 

• Another computer program that performs some kind of further analysis of the 
data, for example, performing more in-depth statistical analysis (e.g., multiple 
regression) on the correlated variables, or 

• Another computer program that performs some transformation or 
optimization on the database. 

The output of this application, can be useful in several possible ways. 

For example, the output may include correlated k-tuples which comprise sets of 
factors associated with one or more disease conditions. Such information, perhaps refined 
through further statistical analysis, can provide breakthroughs in understand, treating, and 
preventing those particular diseases. 

For another example, the output may include correlated k-tuples which comprise sets 
of factors associated with each other, such associations being previously unknown. The 
discovery of associated lifestyle factors, such as particular diets and obesity or particular 
professions and high levels of alcohol consumption, can itself be useful in improving public 
health policy and medical practice. 

All such discovered correlations can potentially be of great benefit to insurance 
providers, public or private, as they must make their actuarial tables and insurance policies 
reflect accurate predictions of health and life expectancy, for example, based on lifestyle, 
socioeconomic and other factors. 

Use of the Principles Described Herein in Clustering Patient Data 

Another rather different application of the principles described herein to public health 
and insurance policy and practice is obtained by considering the transpose of the data matrix 
described above. Instead of patients as objects (rows) and potential disease factors as 
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attributes (columns), consider what is possible when the patients correspond to columns and 
the factors correspond to rows. (See Table 12). Use of the current invention in this 
scenario produces correlated k-tuples of patients, or patient-profiles, in featuie-space. This 
is seen to be a form of clustering of the patient data, into groups of patients or patient 

5 profiles that are roughly similar in terms of their lifestyle factors. Such clustering can be 
useful in designating special "low-risk" or "high-risk" types of patients or insurance 
applicants, to enable more optimal allocation of health services, outreach programs, 
insurance protection, or other resources. Once this transposition of the data is envisioned, 
the other steps of the preceding application to analysis of medical and other databases apply 

10 entirely analogously to the descriptions given above. (See Table 12). 

Use of the principles on the "transpose" of the disease factors database shown 
earlier, in order to cluster the patients or policy-holders in factor- space is shown in Table 12. 
It is now the columns that correspond to a set of patients, medical study subjects, or 
potential insurance policy-holders, while the rows now correspond to potential disease 
15 factors that may include lifestyle factors, socioeconomic factors, workplace factors, and so 

on. There are M* rows and N' columns, where perhaps M*=N and N'=M, for the original M 
and N described above. The value in table cell[j, i] is one (1 ) if patient i possesses or has 
been exposed to factor j and is zero (0) otherwise. 





Patient 1 


Patient 2 




Patient N' 


Factor 1 


1 


0 




1 


Factor 2 


0 


1 




0 












Factor M' 


0 


0 




1 



Table 12 



Application of the Principles Described Herein to the Discovery of the Causes 
25 of Failures in Complex Systems 

Administrators of complex integrated systems such as computer networks and 
factory automation systems have been faced with the difficult diagnosis problems these 
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systems pose since their inception. Where a series of events in the system (perhaps over a 
protracted period of time) leads to a failure of the system as a whole, the diagnosis of the 
true cause of the failure can be an almost insurmountable task. For example, a network 
interface card on a gateway computer that fails intermittently when under high load 

5 conditions may not cause the host computer to crash but may lead to errors on other 

computers that use the card (by proxy) to service their network requests. Such a problem 
would be difficult in the extreme to track down using conventional diagnosis techniques. 
Tools that can present administrators with a better analysis of the conditions on the system 
as a whole that lead to the failure would speed the diagnosis and correction of the under- 

10 lying problem. 

We need to define the database upon which the principles described herein will be 
applied. 

The database as a whole can be thought of as a state record of a series of 
components over time. The columns of this database, when viewed in the data matrix 
15 format used throughout this document, represent the series of components; the rows 

represent discrete points in time. The values in the table are intended to be an encoding of 
each component's state (on, off, idle, error, and so on) at the time in question. Such logging 
procedures are well known to those skilled in the art. 

The rows in Table 13 correspond to points in time, while the columns correspond to 
20 individual components in the system. The value in table cell[i, j] is the encoded state of 
component j at time i. 





Component 1 


Component 2 




Component N 


Time 1 


1 


0 




1 


Time 2 


0 


1 




0 












Time M 


0 


0 




1 



Table 13 
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Steps involved in applying the method of the current invention to analysis of a system 
operations database include: 

1 . Create a database of system components and their states as described above. The 
choice of state sets for the components in the system will be driven by behaviors of 
interest to the administrators of the system as well as by the components themselves. 

2. Present this database, in whole or part, as a data matrix such that each column in the 
data matrix corresponds to a component in the system and each row in the data 
matrix corresponds to a point in time in the series. 

3. Employ the base method above or one of the other embodiments described herein on 
the data matrix. 

4. Direct the discovered correlated k-tuples of attributes to: 

• A graphical viewer or printer, or 

• A rule-generator preprocessor for rule-based system, or 

• A report for the administrators of the system, or a report-generation system, 
or 

• Another computer program that performs some kind of further analysis of the 
data, for example, performing more in-depth analysis on the correlated 
variables, or 

The output in this application, can be used to indicate the events in the system that 
are typically seen to co-occur with a given failure. Given the formulation of the database, 
we need not restrict ourselves to the states of the components in the system at the time of 
the failure - we can expand our examination of the failure conditions to any range of points 
in time for which the database has records. This allows the method to help illuminate subtle 
causal relationships between components that ultimately lead to failure. In the simplest case, 
the output can be used to eliminate some components in the system from scrutiny if it is seen 
that they are not correlated with the failure. 
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Application of the Principles Described Herein to the Analysis of Complex Systems 

Complex systems define a large family of somewhat similar applications. For the purpose of 
this discussion, complex systems are defined as systems for which there are no direct 

5 detailed modeling approaches because these systems comprise a huge number of interacting 
individual components or parts. Examples would include (but would not be limited to) 
economics, individual human behavior, productivity in groups of employees, weather 
patterns, crime in a nation, etc. In each of these cases, there are no known methods to 
model the system exactly so variables or sets of variables are used to measure the state of 

10 these systems (examples in the case of economics would be the interest rate, stock market 
values and inflation rates). For the purposes of this description, the events in these complex 
systems take the form: pre-condition, action and post-condition. These interactions 
represent the state of the system before the actions were taken, the actions themselves and 
the resulting state of the system at some point after the implementation of the actions. Put 

15 another way, the set of previous perturbations of the system and their outcomes are used as 
a history of the system from which to derive information about the system's characteristics. 

The kinds of databases of complex systems that can effectively utilize the principles 
described herein must meet certain restrictions. There must be some set of variables (either 
in common usage or derivable from knowledge in the domain) used to measure the state of 
20 the given system. These variables are used in the pre and post condition parts of each 

database entry. Additionally, there must be some general set of actions that may be applied 
to the system that encompass methods by which it is known the system may be perturbed. 
Returning to the economics example, the action set would include all things under the 
heading of "fiscal policy". 

25 Formally, the database must include attributes representing zero or more pre-condition 

variables, zero or more action variables, and zero or more post-condition variables. Leaving 
aside the trivial case wherein the database contains zero pre and post condition variables and 
zero action variables, there are eight cases to consider. They will be presented exhaustively 
below with examples where appropriate. Note that in each case, there are two 

30 interpretations of relevance. For example, consider the case where we have pre-condition 
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variables and action variables but no post-conditions. The correlations can be derived in two 
ways: the database itself could have had no post-condition variables in it (and the returned 
set of correlations is culled to remove any correlations that involved only variables of one 
type) or it can be that just the set of correlations themselves contain no post-condition 
5 variables even though the database does in fact contain them. For the purposes of the 

discussion, we assume the former is the case - we can always cull the results of the method 
on a database that has more types of variables to leave a set of correlations which do not 
have some types of variables. 

If the database contains only variables of one type (i.e. only action variables or pre or post 
10 condition variables) then the correlations derived from it can be interpreted in one of two 

ways. If the variables are pre or post condition variables, then the results indicate situational 
archetypes - that is, sets of attribute values (or, equivalently, states of variables) that tend to 
be seen together. An example from the domain of weather patterns would be rain and low 
barometric pressure. If only action variables are present in the database then correlations 
15 found between them indicate sets of decisions that tend to be made together. In a military 
domain, we might discover that flanking maneuvers and offensives tended to be seen co- 
occurring. As these types of databases are very similar to others described elsewhere in this 
document (as would be the applications of the method in these cases), this section will not 
explicitly address them. 

20 The cases where the database contains variables of only two of the three types are three in 
number. 

Correlations found in a database that contains only pre-condition and action variables 
describe the relationship between situations in the domain and the selection of actions. An 
example is football play-calling (note that this also involves a complex system that can not be 
25 modeled in any direct detailed way - the play-caller). Here the correlations indicate the 
tendencies of the action-taking entity, e.g., a coach or quarterback. 

If the database contains only action and post-condition variables, then the correlations found 
elucidate the effectiveness of sets of actions regardless of pre-conditions. Going back again 
to the football example, correlations of this type would illuminate the ability of the team in 
30 question to perform certain actions (e.g., if "third and long yardage to first down" tended to 
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result in a poor post-condition set, like fourth down, then we would know that the team 
tended to be ineffective in this situation). Another important example is drug interaction. In 
this case, the actions are the drugs given and the post-conditions are the side-effects 
reported for some patient. 

5 While the utility of the case where the database contains only pre and post condition 

variables may be unclear on first examination, it may well be that this is one of the most 
useful cases. Here we are either interested in things that tend to happen after a situation in 
the given domain regardless of actions taken by the ^ecision-maker or we are in a domain 
where there are no actions that can be taken (or none that effect the system itself) An 

10 example of the former would be the fact that the pre-condition "third and long" in football 
tends to be followed by the post-condition "fourth and long". In fact, it may be the latter 
case that is the most interesting. Consider that case of weather patterns. If we focus on the 
post-condition "tornadoes" (that is, we cull the resulting correlation set so that it includes 
only those correlations that involve the appearance of "tornadoes" in the post-condition), 

15 then what these correlations tell us are precursor signs that tornadoes are immanent. 

The last case is the most general: the database contains all three types of variables. Note 
that a database of this form is capable of having correlations of attributes of all the preceding 
types. Example domains have already been given (economies, crime in a population, etc.) 
Here the correlations can be thought of as rating actions sets (given some set of pre- 
20 conditions) based on the quality of the post-conditions. 

The last consideration is the types of data that the database entries contain. Binary valued 
attributes, as noted throughout this document, can readily be accepted by this method. 
Other value types must be of limited range of discrete values. Where this is not the case (i.e. 
real-valued or integer-valued attributes), some transformation must be performed on the 
25 values in question to reduce their range of values to a more manageable number. Various 
clustering methods are among the preferred methods for this, and are well-known to those 
skilled in the art. 

In all cases, the correlations returned by the method are ideal inputs to a case-based 
reasoning package. Given a condition of the system (i.e. the current condition), a cased- 
30 based reasoning tool could use the associations found by the principles described herein as a 
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basis for analysis of possible outcomes of selections from the set of actions that can be 
applied to the system. 

Generally, the principles described herein can be used as a tool to aid decision-makers. 
Decision-makers can be "real" or artificial (that is, the method can be used as part of an 
5 artificial intelligence engine whose purpose is to make decisions in the domain of interest). 

Description of the Apppiication of the Principles Described Herein to 
Databases with Pre-condition Variables and Action Variables: 

Given the above-noted restrictions on the form of the database, it is clear that the input 
requirements for the application of the embodiments described elsewhere herein are met. In 

10 the convenient data matrix representation cited elsewhere in this document, the M rows in 
this context are the total selected set of pre-conditions and actions taken. If the entity that 
applies the actions can sensibly be personified then these rows can represent a history of the 
decisions made by this entity and the states of the system at the time they were made. The N 
columns comprise the set of state variables that define the state of the system and the set of 

15 all applicable action variables that describe the ways in which the system can be perturbed 
(see Table 14). 

The rows of Table 14 correspond to instances of or combinations of system states (the pre- 
condition of the system) followed by actions taken in response to that state, while the 
columns correspond to variables thought to describe the state of the system and possible 
20 actions that can be applied to the system. The value in table cell[i, p] is an encoding of the 
measure of state variable p in event i if column p is a pre-condition column and is an 
encoding of the action taken in event i if column p is an action column. 
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A(l,j+k) 


Row 
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C(2,l) 
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A(2j+2) 




A(2j4k) 
















Row 
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C(m,l) 




C(m,j) 


A(mj+2) 




A(m,j+k) 



Table 14 

There are some other considerations that must be addressed prior to the application of the 
10 Principles described elsewhere herein to any given domain. The set of state variables must 
be defined. This is left to those skilled in the domain itself (e.g., football coaches, military 
analysts, etc.) 

Previously noted examples are the case of football play-calling by coaches and military 
decision made by generals. In general, preferred implementations of this invention will use 
1 5 the method of the current invention on databases of this form in order to extract information 
about the action-taking entity. The correlated state variables and actions describe the 
tendencies of this entity. As noted above, these may be further analyzed using case-based 
reasoning tools to give a better picture of the entity* s likely decisions given a state of the 
system. 

20 Another use of the invention on databases of this type is in discovering fraud indicators in 
tax collection. Here we let the pre-conditions be a set of attributes intended to capture the 
salient details of a tax return (such things as total income, total tax owing as reported by the 
individual or business, tax exemptions claimed, etc.) and choose the action variables to 
define a set of possible tax evasion methods. The correlations found by the invention then 

25 indicate associations between types of tax returns and types of tax evasion. As coincidence 
detection bounds the returned correlations statistically, we not only find indicators of 
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evasion but also the reliability of these findings. Given that tax collection agencies can not 
afford to investigate all tax returns sent to them, this method allows them to find a well- 
chosen subset of these returns that is most likely to result in findings of fraud (and greater 
monetary returns for the government). 

5 The last such use that will be presented is in the domain of insurance fraud and is very 
similar to the application of the principles described herein to tax collection. The pre- 
condition variables are intended to capture a set of details in an insurance claim that are 
thought to be possible indicators of fraud (amount claimed, specifics concerning the insured 
entity, etc.) and the action variables represent types of fraud. The results found when the 

10 principles described herein are applied show correlations between the details of insurance 
claims and types of fraud. Insurance companies can not investigate all claims sent to them; 
so, the application of the principles described herein will narrow the total list of such claims 
to a set more likely to be the subject of fruitful investigations. 

Steps involved in applying thprinciples described herein to a database containing pre- 
15 condition and action variables include: 

1 . Create the database of system states and actions taken by the action taking entity as 
described above. Where necessary, use methods known in the art to transform 
continuous- valued attributes into discrete-state attributes. 

2. Present this database, in whole or part, such that each states/action set corresponds to 
20 one of the M objects (rows) in a data matrix and so that each state type aspect and 

action type corresponds to an attribute (column) of the data matrix. 

3 . Employ the base method or other embodiment described herein on the data matrix. 



4. Direct the discovered correlated k-tuples of attributes to: 

• A graphical viewer or printer, or 

25 • A report for decision-makers, or a report-generation system, or 

• Another computer program that will use the correlations found as a basis for making 
decisions (for example, a case-based reasoning package), or 
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• Another computer program that performs some transformation or optimization on the 
database. 

This application of the principles described herein provides and utilizes a list of correlated 
state/action sets that give insight to the inclinations of the action-taking entity. Were one to 
5 be interested solely in one system state (or in only a few aspects of a given state), for 

example the current state, one could cull the results of any correlations that do not share a 
given set of aspects with that state. The resultant set would represent correlations between 
the aspects of interest and the actions taken in response. The resulting insight into the 
action-taking entity's methodology can be used in further decision-making. 

10 Description of the Principles Described Herein as Applied to Databases with 

Pre-condition Variables and Post-condition Variables: 

Here, too, the above-noted restrictions on the form of the database force compliance with 
the input requirements of the embodiments described elsewhere herein. The M rows in this 
context are the instances or combinations of pre-conditions and post-conditions (viewed 
15 together, one can think of these rows as being the system's transitions between states). The 
N columns are comprised of the set of state variables that define the state of the system 
before and after the transition (see Table 15). 

The value in cell[i, j] of Table 15 is an encoding of the measure of state variable j either 
before or after the transition. 
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C(m,j+k) 



Table 15 

10 There are some other considerations that must be addressed prior to the application of this 
invention in any given domain. The set of state variables must be defined. This is left to 
those skilled in the domain itself. 

Equally important is the selection of time quanta that define the granularity of the transitions. 
This too is left to those skilled in the art to decide based on their own expertise and the kinds 
15 of information they wish to extract. It is assumed that some minimum granularity is imposed 
by either the complexity of gathering such data or by the limits of the usefulness of such 
data. Given this, one can then pick any multiple of this minimum granularity to be the time 
between pre and post conditions. At the very least, this distance in time should be long 
enough for the system to have changed it's state. 

20 Possible domains of application for this invention include economics and fiscal policy, stock 
market prediction, athletic talent scouting and weather prediction. Presented below are brief 
descriptions of each in turn to show how these problems may be organized to fit the 
specifications of the method of the current invention. 

In the domain of economics and fiscal policy, we propose a database of sets of states where 
25 the states are a set of economic indicators (inflation and interest rates, housing starts, GDP 
and so on). Each row in the database should contain two such states (the pre and post 
condition of the system) separates by a fixed amount of time. The correlations found in by 
the method of the current invention then give insight into cycles in the economy. 
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For stock market prediction, we propose a set of stocks (presumably large) which are 
thought to have influence over one another. Again, a fixed period of time is selected for 
transitions. The rows of this database then tell the transition of these stocks over the chosen 
period of time. The output of the invention then indicates which sets of stocks "move" in a 
5 correlated manner over that period of time. 

Athletic talent scouting (e.g., by professional teams prior to a draft of young players) would 
involve an examination of the history of such selections. Each row of the data matrix would 
then pertain to an individual player. The pre-condition state is a selection of statistics (and 
any other information available about the player) thought to be indicative of future 

10 performance at the professional level. The post-condition state would then be some set of 
variables intended to measure that player's success at the professional level. The 
correlations discovered by the invention would help teams find the best set of indicators of 
future success with which to make their selections. Note that in this case, the pre and post 
conditions need not be of exactly the same form. There is no intended restriction on state 

15 representations to force them to be equivalent. 

Weather prediction is a very straightforward application of this invention. Here the 
granularity of the selected time quantum is based solely on the kind of information the user 
wishes to discover. Put another way, the time quantum determines the degree of prediction 
desired. If we choose a single day, then the correlations found by the method will help us 
20 predict the weather (given a set of values for each of the pre-condition variables that 

describes the current weather) a day in advance. If a week (or a month etc.) is the chosen 
quantum, then this is how far into the future the predictions will extend. 

In general, preferred embodiments of this invention will use the method of the current 
invention on databases of this form in order to extract information about how the current 
25 state of the system acts as a predictor for a future state. Given probabilistically bounded 
data correlations between states of the system, effective predictions can be made about the 
system's behavior. 

Steps involved in applying current invention to a database containing pre-condition and 
action variables include: 
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1 . Create the database of transitions between system states, wherein a system state is 
represented by a value of a state variable, over the chosen time quantum as described above. 
Where necessary, use methods known in the art to transform any continuous-valued state 
variables into discrete-state variables. 

5 2. Present this database, in whole or part, such that each state to state transition set 

corresponds to one of the M objects (rows) in the embodiment's data matrix and so that 
each state variable corresponds to an attribute (column) of the data matrix. 

3. Employ the base method or other embodiment described herein on the data matrix. 

4. Direct the discovered correlated k-tuples of attributes to: 
10 • A graphical viewer or printer, or 

• A report for decision-makers, or a report-generation system, or 

• Another computer program that will use the correlations found as a basis for making 
decisions (for example, a case-based reasoning package), or 

• Another computer program that performs some transformation or optimization on 
15 the database. 

Description of the Application of the Principles Described Herein to Databases 
with Action Variables and Post-condition Variables: 

Here, too, the above-noted restrictions on the form of the database force compliance with 
the input requirements of the embodients described eslewhere herein. The M rows in this 
20 context are the total selected set of actions and post-conditions. The N columns arc 

comprised of the set of state variables that define the state of the system before and after the 
transition (see Table 16). 

The rows of Table 16 correspond to observed instances of, or hypothetical combinations of, 
actions applied to the system and their resulting system states. The columns correspond to 
25 either possible actions that can be applied to the system or are individual state representation 
variables. If column p corresponds to one of the action types in the database, the value in 
table cell[i, p] of Table 16 is an encoding of the action taken. If column j is a column used 
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to indicate some aspect of a state of the system, then the value in table cell[i, j] is an 
encoding of the measure of that aspect. 
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Table 16 

As noted in previous examples, decisions that must be made prior to the application of the 
method of the current invention to databases of this type include the choice of state variables 
10 used to store the state of the system at a given point in time and the choice of time quantum 
used to temporally separate the actions from the post-conditions. These choices are left to 
those skilled in the domain of application. The time quantum chosen must, in the most 
trivial case, be long enough for the actions to have had some effect on the state of the 
system. 

15 Possible uses of this invention include such widely varying fields as player management in 
hockey and the study of drug interaction. 

For the purposes of this document, player management in hockey concerns only the selection 
of players for the next shift on the ice given knowledge of the history of these players. The 
action variables in this case are binary values indicating whether or not a player is selected 

20 for the shift while the post-condition variables comprise a set of outcomes within the domain 
of hockey (such things as the relative score in that shift, penalties called, the length of any 
penalties, relative number of shots taken, etc.). By the formulation of the problem, it is clear 
that the discoveries produced by the invention indicate correlations between sets of players 
chosen and outcomes on the next shift. In situations where the opposing players are known 

25 a priori, these players can be added to the action variables. In this case, we will find 

correlations between sets of players, both for our team and against it, and outcomes. Given 
this knowledge the invention is useful as an aid to coaches in selecting players most likely to 
produce beneficial results. 
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The study of drug interaction is a natural fit for this invention Here we let the action 
variables be binary values indicating whether or not a given patient has been administered 
some drug or combination of drugs. The post-condition variables indicate the list of side 
effects reported by the patient. The results found by the invention then indicate statistically 
5 bounded correlations between sets of drugs given to patients and side effects. In this 

fashion, the method of the current invention can be used to determine contra-indications in 
the use of drugs but is perhaps best suited as a way to select sets of interactions upon which 
to focus further study. 

Steps involved in applying current invention to a database containing action and post- 
10 condition variables include: 

1 . Create the database of transitions between system states and actions over the chosen 
time quantum as described above, wherein a system state is represented by a value of a state 
variable and an action is represented by a value of an action type. Where necessary, use 
methods known in the art to transform continuous-valued state variables and action types 

15 into discrete state variables and action types. 

2. Present this database, in whole or part, to an embodiment of the current invention such 
that each action set/state set pair corresponds to one of the M objects (rows) in the 
embodiment's data matrix and so that each state variable or action type corresponds to an 
attribute (column) of the data matrix. 

20 3. Employ the base method or other embodiment described herein on the data matrix. 
4. Direct the discovered correlated k-tuples of attributes to: 

• A graphical viewer or printer, or 

• A report for decision-makers, or a report-generation system, or 

• Another computer program that will use the correlations found as a basis for making 
25 decisions (for example, a case-based reasoning package), or 

• Another computer program that performs some transformation or optimization on 

the database. 
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Description of the Application of the Principles Described Herein to Databases with 
Pre-condition Variables, Action Variables and Post-condition Variables: 

Here, too, the above-noted restrictions on the form of the database force compliance with 
5 the input requirements of the embodiments described elsewhere herein. The M rows in this 
application are the total selected set of pre-conditions, actions and post-conditions. Th^ N 
columns are comprised of the set of state variables that define the state of the system before 
and after the transition as well as the encoded actions types (see Table 17). 

The rows of Table 17 correspond to instances or combinations of pre-condition, actions 
10 taken and the resulting post-conditions. The columns correspond to types of actions 

possible in the domain as well as aspects of interest to any given situation in the domain (for 
both pre and post condition columns). If column p corresponds to one of the action types in 
the database, the value in cell[i, p] of Table 17 is an encoding of the action taken. If column 
p is a column used to specify some aspect of either the pre-condition or the post-condition, 
15 then the value in table cell[i, j] is an encoding of the measure of that aspect. 





Pre 1 




Pre i 


Act 1 




Act j 


Post 1 




Postn 


Row 1 


C(l,l) 




C(l,i) 


A(l,i+1) 




A(l,i+j) 


C(l,i+j+I) 




C(l,i+i+n) 


Row 2 


C(2,l) 




C(2.i) 


A(2,i+1) 




A(2,i+j) 


C(2,i+j+l) 




C(2,I+j+n) 






















Row M 


C(m,l) 




C(m,i) 


A(m,i+1) 




A(m,i+j) 


C(m,i+j+l) 




C(m,i+j+n) 



20 Table 17 

As noted in previous examples, decisions that must be made prior to the application of the 
method of the current invention to databases of this type include the choice of state variables 
used to store the state of the system at a given point in time and the choice of time quantum 
used to temporally separate the actions from the post-conditions. In this case, it should be 
25 noted that it is not necessary for the pre and post conditions to be equivalent (with respect to 
the choices of variables). These choices are left to those skilled in the domain of application. 
The time quantum chosen must, for example, be long enough for the actions to have had 
some effect on the state of the system. 
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Possible uses of this invention include economic policy, crime-fighting and military 
strategizing. 

Given some set of variables to define the state of an economy (interest rates, inflation, GNP 
and so on) and a set of actions taken as part of the governing body's economic policy 

5 (issuing and buying back government bonds, etc.), we create a database of economic events 
of the form: existing economic state, fiscal policy measures taken and economic state 
following the policy decisions. The correlations found by the method of the current 
invention give a measure to the effectiveness of economic policy decisions, given a state of 
the economy. Such knowledge would be beneficial in deciding economic policy as it would 

10 show historical support (or the lack thereof) for a given set of decisions. 

In a similar vein, the use of the current invention to aid in setting anti-crime policy starts 
with the creation of a database of previous states of the community's crime, policy measures 
taken and the resulting state of crime in the community. The state variables could include 
things like the rates for differing types of crime (breaking and entering, auto theft, etc.), 

15 differing characteristics of crime (i.e. whether or not handguns were used etc.) and so on. 

The action variables in this case could include such things as minimum sentencing guidelines 
for various crimes, "three-strike" laws, the adoption of the death penalty, as well as 
education and mental health funding. On such a database, the invention would find 
correlations involving existing crime states, policy decisions and the outcomes of those 

20 decisions. It is proposed that these correlations could prove an invaluable aid to those 
charged with making such decisions. 

The concept of the "decision-maker" needs careful consideration in the domain of military 
strategy. It may well be the case that there is not enough of a "track record" to fill a 
database with enough of a history of any one general's decision making. In such a case, 

25 preferred implementations can extend the concept of the decision-maker to include all similar 
decision-makers. As an example, consider a single general commanding a tank division. If 
the general were recently promoted, one would be wise to consider all the history of all such 
generals of the same allegiance. To increase further the granularity of the use of the method, 
the database could be filled with the decisions made by all infantry lieutenants rather than 

30 with those of any one lieutenant. Correlations found would be indicative of the tendencies 
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of that class of generals given some measure of the battlefield conditions faced when they 
made their decisions. Equally, one would be in a position to determine which battlefield 
situations they handled poorly because one has access to the outcomes of the decision sets. 
Such knowledge could prove vital to selecting an opposing strategy. 

5 Steps involved in an application of the principles described herein to a database containing 
pre-condition, action and post-condition variables include: 

1 . Create the database of states and actions covering the chosen time quantum as described 
above. Where necessary, use methods known in the art to transform continuous-valued 
state variables and action types into discrete state variables and action types. 

10 2. Present this database, in whole or part, such that each state/action/state triple 

corresponds to one of M objects (rows) in a data matrix and so that each state variable or 
action type corresponds to an attribute (column) of the data matrix. 

3. Employ the base method or other embodiment described herein on the data matrix. 

4. Direct the discovered correlated k-tuples of attributes to: 

• A graphical viewer or printer, or 

• A report for decision-makers, or a report-generation system, or 

• Another computer program that will use the correlations found as a basis for making 
decisions (for example, a case-based reasoning package), or 

• Another computer program that performs some transformation or optimization on 
the database. 

It will be understood by those skilled in the art that this description is made with 
reference to the preferred embodiment and that it is possible to make other embodiments 
employing the principles of the invention which fall within its spirit and scope as defined by 
the claims on the pages following Appendices A through E attached hereto, which 
25 Appendices form a part of this description. 
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APPENDIX A 



# perl version of Evan Steeg's Coincidence Detection Algorithm, File COincpl: 1/15 
t here applied to data which comes in rows and columns of ascii 

# symbols. Used first for tests on artificial and real (HIV) 
8 protein sequence data. 

# march 1996 



• *#######* *#»#«#**#####**#«#** t*»#* #*#*#»###**#######* ############### 

$tiny_num = 0.000001; 

$fact(0] = 1; 
$fact£l]*= 1; 
$fact(2J = 2; 
$fact(3] = 6; 
$fact[4J = 24; 
$fact(5] = 120; 
$fact(6] = 720; 
Sfact{7) = 5040; 
$fact(8J = 40320; 
$fact(9) = 362880; 
Sfact(lO) = 3628800; 
$fact(llj = 39916800; 

sub compare 

Lf ($a < Sb) 

$r = -1; 

lsif ($a == $b) 

Sr = 0; 

lse 

$r = 1; 

# print "a: $a, b: Sb, r: $r\n" ; 

return $r; 

) 

sub corap_aa 

{ 

# my (Sal, $cl, $a2, Sc2, Cr) .- 
my {$cl, $c2 ) ; 

# Sal = substr 3a. 0, 1; 
Scl - substr $a, 1; 

# Sa2 - substr $b, 0, 1; 
Sc2 = substr $b, 1; 

f ($cl < $c2) 

$r = -1; 

lsif ($cl == $c2) 

Sr = 0; 

;lse 

Sr = 1; 
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> File coincpl: 21 15 

return $r; 



# calc the factorial of a number. want (n) 

# for now, it's just easier and faster to hard code them into a table 
sub factorial 

( 

my ($nj = e_; 

tf print *n: $n\n"; 

if ($n >= 0 && $n 11 ) 
( 

return $fact[$nl; 

> 

else 

{ 

print "ERROR: n larger than max defined factorial requested. ($n)\n"; 
exi t (0) ; 

} 

) 

# calc the binomial coeff. want r (number of iterations) and h ( 

# observed number of hits) 
sub binomial_coef f 

{ 

my <$r, $h) = 

# print T: $r, h: $h\n" ; 

$rf - &f actorial ( $r> ; 
$hf = &f actorial ($h) ; 
$rhf = ^factorial ( ($r - $h) ) ; 

# print Tf: $rf, hf : $hf, rhf : Srhf\n"; 

return ($rf / ($hf * $rhf)); 

} 

# calc the chernoff. want (Sobserved. Sexpected. Sri, $T1) 
sub chernoff 

( 

my ($observed, $ expected, Sri, ST1) = <?_; 

Sdiff - $observed - Sexpected; 

Sdiff_sq = Sdiff * Sdiff; 

Snumerator = 2.0 * (0.0 - $diff„sq); 

Scenominator = ST1 * ($rl * $rl); 

return (exp (Snumerator / Sdenominator ) 1 ; 

) 

ft calc the ith power oC a number. NOTE: this thing can only grok 
# positive integer exponents larger than 0! 
sub pow 
( 

my (Si, $p> = @_; 

if ($p < 0 M 3p ! = int ($p) ) 
( 

print "ERROR: I can only grok positive integer exponents larger than 0'.\n"; 
exit (0) ; 

) 
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Sa = i.o; File coinc.pl: 3/15 

for <$n =0; $n < $p; $n++) 
{ 

Sa * = Si; 

J 

# print "i: Si. p: $p, a: $a\n" ; 
return $a; 

) 



# want (Sr. Sh, $c_element) , cset and aasites assumed as global 

sub prob — coincidence 

{ 

my ($r, $h, $c_element) = 
my ©elements ; 



if <$r > 0) 
( 

S joint = 1.0; 

S joint_neg = 1.0; 



@aalist - split /\|/, Sc.element; 
Uprint "c_elelment: $c_element. aalist: @aalist\n"; 

foreach $aa (Gaalist) 
( 

Sjoint *= $aasites($aa} ; 

Sjoint_neg *= (1.0 - Saasites ($aa ) ) ; 
#print *aa: $aa, joint: Sjoint, joint_neg: $ joint_neg\n* 



) 

# Sans = fcbinoraial_coef £ ($r, $h> * &pow( Sjoint, Sh) * 

# &pow($ joint_neg, {$r - Sh) ) ; 

Sans = &binomiai_coef f ($r, $h) * (Sjoint ** $h) * 
($joint_neg ** f$r - Sh) ) ; 

) 

else 

( 

return (0.0); 

} 

#q print "joint: Sjoint, joint_neg: $joint_neg. ans : Sans\n" ; 
return Sans; 

) 

sub expected_size 
( 

my ( $r , $c_eleinent ) - (?._; 



Ssum - 0.0; 



foreach $h (1. .Sri 
I 

Ssum +ss ( tprob_coincidence ($r . Sh, $c_element) * Sh) ; 
#print "r: $r, h: $h, sum: $sum\n" ; 
) 

return $sum; 

) 

sub prob_of ^correlation 
( 

my ($c_element. $h_total_obs , $h_expected_total , Sr. ST) = <?_; 
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File coincpl: 4/15 



# $h_expected_total = &expected_size(Sr ( $c_element) ; 

$ch = &chernof f f $h_total_obs . ($h_expected_cotal • ST) , Sr. ST) ; 

return Sen; 

} 

# randomly select a list of ' sample_size ' unique sequences 

# in the range from 0 to the number of rows in Q family 

# want samplers ize. family, 
sub rsample_family 

( 

my SR = shift 
my e family = (?_; 

my (%which_rows, @sampled_f amily, @sampled_rows ) ; 

# print "whichrows: keys %which_rows, "\n"; 

# generate $R number of unique keys 
$f = scalar Q family ; 

while (scalar {keys %which_rows) < $R) 
{ 

$n = int (rand $f ) ; 
•print "randnum: $n\n' ; 

$which_rows { $n} = 1; 

) 

# print "whichrows: keys %which_rows, "\n"; 

# pick out . the corresponding sequence from the 'family list* 
@sampled_rows - keys %which_rows; 

foreach Sline (@sampled_rows ) 

{ 

push @sampled_f amily , $f amily [ $line] ; 

) 

#print " RSAMFLE\n" ; 

# $i = 0; 

# foreach Sline ( @sampled_f amily ) 

# ( 

# print Sline, " : "; 

# Sn = $sarr.pled_rows (Si i ; 

# print $n. ":",$ family I $n ] . *' \n"; 



# > 

•print -RSAMPLE END\n* ; 
•exit (0) ; 

return @sair.pled_f ami ly ; 

> 



» return rhe n'th column of an array 
tt want (">n, Oar ray i 
sub column 
( 

my Sn = shift 
rrry <£a = $_ ; 
my S c o i ; 

•print "COLUMN: Sn\n* ; 
Horeach (@a) 
• ( 

tt print " $_\n" ; 
») 



$i + +; 
prim 



.t -$line\n"; 
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File coincpi: 5/15 

# go thru and append the n'th element of each row in ©array to Scol 
Scol = " ' ; 
foreach $line (@a) 
( 

Scol = Scol . substr Sline. $n, 1 ; 

) 

# print length Scol. Scol. "\n"; 
ttprint -COLUMN END\n" ; 

return $col ; 

) 

# find all occurences of a character *aa' in the n'th column of the 

# array s amp led_ family 

# want ($aa, $n, @sampled_f amily > 
sub find_all 

{ 

my $aa = shift @_; 

my $n - shift @_; 

my @san;pled_ family - @_; 

my (Sbstring, Scol); 

# print -FIND_ALL; Saa, Sn\n" ; 

# print "012345678901234567890\n- ; 
U foreach ( @sampled_f amily > 

# ( 

# print *$_\n"; 

# } 

# print "JUMPING TO COL\n" ; 

Scol = ^column ($n, @sampled_f amily ) ; 

# print 'GOT: $col\n* ; 

Sbstring - 

if ( (index Scol, $aa) != -1) ft make sure Saa is found in Scol 
( 

for <$i=0; Si < length Scol; $i + +} 
{ 

$c = substr Scol, Si, 1; 

if ($c eq $aa) 

( 

Sbstring = Sbstring . "1"; 

) 

else 
( 

Sbstring - Sbstring . "0"; 

) 

) 

,* 

eise 
( 

Sbstring = " NOT_FOUND * ; 

) 

# print ' Sbstring\n" ; 

tt print ' FIND .ALL ENTAn" ; 

tt exit(O); 

return Sbstring: 

) 

n this subroutine isn't exactly the most optimal code, but.... 
sub mi 
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{ File coin c. pi: 6/ 15 

my (Scoll. $col2, $m) = <?_,- 

my (Ssl, Ss2. $row, Spl. Sp2 , $pj . Sal. $a2 , $s, %sj, Scontnb. Stotal); 

$sl = column(Scoll. ^family) ; 
$s2 = coiumn($col2 ( ©family) ; 

# print -coll: Scoll, Ssl\n"; 
« print -col2: $col2, $s2\n-; 

# print "keysl: keys %sj, ■ \n" ; 

* calc the joint prob 
f or $row (C . . C$sa-l! -i 
{ 

$al = substr $sl, $row, 1; 
Sa2 = substr $s2, $row, 1; 

$s = Sal . $a2; 

if (exists $sj {$s} ) 

$sj($s)++; 
else 

Ssj{$s} = 1; 

* print "al: Sal, a2 : $a2. s: $s\n"; 
} 

# print "keys2: • , keys %sj , -\n"; 



foreach $s (keys %s j ) 
{ 

$sj{$s) = $sj{$s> / $m; 
if ($sj($s) < Stiny_num) 

V 

$sj{Ss) = Stiny_nur\; 

U print "$s: SsH($s}\n"; 

} 

Stotal = G; 

foreach Ss (keys %sj) 

{ 

Sal = substr Ss. 0, 1; 
$a2 = substr Ss, 1. 1; 

8 find partial probs 
Sal Sal . Scoll; 
Sa2 = Sa2 . 3col2 : 

$pj - Ssj(Ss); 

Spl = asites ( $al ) ; 

Sp2 = Saasites lSa2 ) ; 



if (Spl < $tiny_num) 

( 

Spl = $tiny_num; 

) 

if (Sp2 < Stiny_num) 
( 

Sp2 = $tiny_nun\; 
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> File coincpl: II 15 

if (Spj < Sciny_num) 
t 

Spj - S C i ny_num ; 

} 

Sconcrib * (Spj • log (Spj / (Spl * Sp2 ) ) ) ; 
Stocal ♦= Scontrib; 

# print "al: Sal, a2 : Sa2, s: Ss , p j : Spj, pi: $pl . p2 : Sp2, contrib: Sconcrib, total : Scot 
) 

return Stotal; 

) 



sub incidence_vec 
( 

my ($col, $key) - <?_; 
my (Svec) ; 
$ vec = 

if ( (index Scol, Skey) != -1) 
< 

for $i (0.. ((length Scol) - 1)) 
( 

$c = substr Scol, Si, 1; 

if ($c eg $key) 

{ 

$vec = $vec . "1"; 

> 

else 
{ 

$ vec = $vec . " 0 " ; 

} 

) 

} 

else 

( 

Svec = " NOT_FOUND " ; 

} 

return $vec; 

J 



ft given two columns, go through each letter in the alphabet ana 
tt generate the incidence vector for them. then if the results are 
It non-zero, send them to mi2_real for the re^.i computations 
sub mi2 
( 

my (Scoll, Scol2, $m) = <?_; 

my (Ssl. $s2, Skeyl, $key2, Stotal, Ssum! ; 

$sl - column ( Scol 1 . ^family); 
Ss2 = column ( $col2 , Qfamilyi; 

Ssuro = 0.0; 

foreach Skeyl (keys %alphabet) 
( 

Svecl = incidence_vec (Ssl. Skeyl) ; 

if (Svecl ne "N0T_FOUND" ) 
( 

foreach Skey2 (keys %alphabet) 
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File coincpl: 8/ 15 

Svec2 = incidence_vec ($s2, $key2) ; 

if ($vec2 ne "NOT_FOUND" ) 
{ 

$total = mi2_real($vecl< Svec2 , $m) ; 



ft print *sl : $sl\n* ; 

# print 'vecl: $vecl\n" ; 

# print *s2 : $s2\n"; 

# print "vec2: $vec2 \n" 



if ($total > 1.0) 

printf *rai2, cols: %d, %d j keyl : $keyi j key2 : $key2 { total: %.5f\n" 
$sum += S total; 

} 



> 



} 



) 

} 

print "total sum: $sum\n" 

) 



# Given two columns (the actual string of amino acid symbols), 

# produce all combinations (pairs) of attrl, attr2, where attrl is 
$ an incidence vector for a symbol occurring in coll and 

# likewise for attr2 from col2 . Then call mi2 on the pair 

# of incidence vectors. 

# Compute mutual_info (attrl. attr2) where attri are binary incidence 

# vectors for two al@coll, a2@col2. 

sub mi2_real { 
my (Sattrl, $attr2, Sm) = <?_; 

my <$a, Sal . $a2 , $s. $p0. $pl, $pj , %hash_s inglel , %hash_sing le2 , 
Stotal , %hash_joint) ; 

for Srow i 0 . . < Sm-1) ) 

t 

Sal = substr Sattrl, Srow. 1; 

Sa2 - substr $attr2 , Srow, 1; 

Ss = Sal . Sa2; 

#print 'row: Srow, si: $al, as. : $a2 , s: $s\n" ; 

if (exists $hash_singlel (Sal) ) 

{ 

Shash_s ingle I { Sal ) ++ ; 

) 

else 

{ 

$hash_s ingle 1 <$al) - 1 ; 

) 

if (exists $hash_single2 ( Sa2) ) 
( 

Shash„single2 ( Sa2 ) * ♦ ; 

) 

else 
( 

Shash_single2 ( $a2 ) = 1 ; 

1 
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if (exists $hash_;joint{$s) ) File coinc.pl: 9/ 15 

( 

$hash_j oint ( Ss } ; 

} 

else 
{ 

$hash_ joint ($s } = 1; 

> 



for each Ss (keys %hash_joint) 
{ 

$hash_joint($s> = $hash_Joint ($s) / $m; 

if ($hash_joint ($s> < $tiny_nuro) 
( 

$hash_joint($s) = $tiny_num; 

) 

#print "s: $s, h j : $hash_joint ($s> \n' ; 
} 

foreach $a (keys %hash_singlel ) 
{ 

$hash_singiel ($a) = $hash_singlel { Sa } / $m; 

if ($hash_singlel($a) < $tiny_num) 
( 

$hash_singlel ($a) - $tiny_num; 

) 

iprint "a: $a, hsl: $hash_singlel ($a ) \n - ; 
} 

foreach $a (keys %hash_single2 ) 
{ 

$hash_single2($a) = $hash_single2 ( $a) / $m; 

if ($hash_single2($a) < $tiny_num) 
( 

$hash_singie2 ($a) = Stiny_num; 

) 

#print - a: $a, hs2: $hash_single2 ( $a / \z\ m ; 
) 



foreach $s (keys %hash_joint ) 

( 

$al - substr $s, 0, 1; 
Sa2 - substr $s, 1. 1; 
$pj = $hash_joint { $s } ; 
Spl = $hash„singlel($al) ; 
Sp2 - $hash_single2 ($a2) ; 

if (Spl < $tiny_nuin) 

Spl = $tiny_num; 

if (Sp2 < Stiny_num) 

Sp2 = $tiny_num; 

it (Spj < $tiny_num) 

S p j = St iny_num ; 
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File coincpl: 10/15 

$total ♦= ($pj * log ($pj / ($pl * $p2)>); 

) 

return $ total; 

) 



##############################################** 

# check to make sure a file name was given 
if (scalar 0ARGV J- 4 J 

( 

print "usage: $0 data_file sample_si2e iterations min_f req\r.' ,- 
exit; 

) 

$ filename = $ARGVf01 ; 
$sample_size = $ARGV[ 1 ] ; 
$iterations = $ARGV(2J; 
$min_freq = $ARGV(3J; 

# read contents of file into array family 
open (DATAFILE, $filename); 

©family = <DATAFIL£>; 

chop © family ; 

# remove nial's +, and ( delimiters 

tSfamily = grep <!/\+/, ©family); # get rid of lines beginning with ' + ' 

•foreach (©family) # remove all *j's 

#{ 

# tr/\|//d; 
t) 

#©faroily = grep (/"\w/, ©family); 

ftforeach (©family) 
#( 

# print " $_\n " ; 
#) 

(♦while (length $ family [( scalar Gfamiiy) -1) < J.) 
#{ 

# print "Empty line: *, scalar ©family, " deleted . \n" ; 

# pop ©family; 
#) 

#$i = 0; 

•foreach (Gfamily) 
#{ 

# print "Si : S_\n" ; 

# $i*+; 
ft) 



# NOW for the real stuff! 

print " Sample_size : $sample_size\n" ; 
print "Iterations : $ iterations\n" ; 
print "Min_fre : $min_freq\n" ; 

# construct aasite list 
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$n = length $family(0]; File COincpl: 11/15 

$m - scalar ©family; 
foreach $row (©family) 
( 

for $j (0. . (Sn - 1) ) 
{ 

$c = subscr $row, $ j . 1; 
if (length Sc != 1) 
{ 

print "BUG!!! $row, $j\n'; 
exit ; 

} 

•print *$line:$ j : $c\n" ; 
$i = $j; # + 1; 

$s = $c . $i; U create aasite name 

# print *c: $c, j: $ j , i: $i, s: $s\n" ; 

if (exists Saasites ( $s } ) 
{ 

$aasites($s) ++ ; 

> 

else 
{ 

Saasites(Ss) = 1; 

) 

) 

> 

# figure out the alphabet 
•@a = keys %aasites; 
#print @a, "\n - ; 

* foreach (@a> 
#( 

« print '$_:$aasites{$_}\n-; 

# ) 

foreach Sentry (keys %aasites) 
( 

Sc = subscr Sentry. 0. 1; # want the first character in each entry 

# print $c, "\n"; 
Salphabet ( Sc } = 1; 

} 

print keys %alphabet, "\n"; 

# calc marginal probabilities for each column of aasites 
foreach $key (keys %aasites) 

{ 

$p - Saasites { Skey) / Sm; 
Saasites (Skey ) - Sp ; 
» print 'Skey : $p\n" ; 



for Scoll (0. . ($n-2) ) 
i 

for Scol2 ((Scoll + l)..(Sn-D) 
( 

Smi = &mi($coll, $col2, Sm) ; 

print -columns: (Scoll +1), " V (Scol2 ♦ 1), " mi = Smi\n" ; 

Smi2 = mi2 (Scoll, Scol2, $m) ; # might as well qo mi2 while we're here 
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File coincpl: 12/15 

#exit ; 

tfttft MAIN LOOP 

# seed the random number generator 
#$seed = 111; 

Isrand ($seed) ; # remove ' ($seed) to get seed from the system clock 
srand ( ) ; 

# print "START MAIN LOOP\n" ; 

for ($iter=0; $iter < $iterations; $iter++) 
( 

my %BINS; 

# print ■ \nITERATION : $iter\n"; 
print STDERR • ITERATION: $iter\n" ; 

# print "JUMP TO rsaraple_f amily\n" ; 

<?sampled_f amily = &rsample_f amily { $sample_size , @£air.ily) ; 

# print "sample size: $sample_size\n" ; 

# print " 012345678901234567890\n- ; 

# $i = 0; 

# foreach (@sampled_f amily > 

# { 

# print *$i : $_\n" ; 
ft $i++; 

# ) 

# print "rsample printed\n" ; 



foreach Saasite (keys %aasites) 
{ 

$aa - substr Saasite, 0, 1; 

$col_num = substr $aasite, 1; 

ft print "aa: $aa, colnum: $col_num\n"; 

$occurence_string = &fir.d_«ll (Saa, $col_nura, £*sampled_f amily ) ; 

# print $occurence_string. *\n"; 

if ($occurence_string ne " NOT_ FOUND " ) 
{ 

* print "FOUND occ_£tr: $occurence„scring\n" ; 
if (exists $BINS { $occurence_s tring) ) 

( 

SBINS { Soccurr nce_s tring ) = SBINS { $occurence_str mg} 

Saasite . " \ " ; 

} 

else 
t 

SBINS ( $occurence_str ing ) = Saasite . *]"; 

) 

} 

# foreach (keys %BINS) 

# ( 

# print "$_:SBINS(S_}\n- ; 

# ) 

ft sort the collision list associated with each BIN and throw away 
n entries with just one 'collision' 
foreach Sbin (keys %BINS) 
{ 
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my ©aalist; File coinc.pl: 13/15 

Ss = $BINS<$bin>; 
tt print $s, "\n"; 

Gaalist = split /\|/, $s; 

# $i = 0; 

# foreach (@aalist) 
ft { 

# prinf$i : $_\n" ; 

# $i++; 
U ) 

if ( (scalar daalist) > 1) # throw away single 'collisions' 
{ 

# then sort the others 

# $sorted_aalist = join "|", sort comp_aa @aalist; 
$sorted_aalist = join "|", sort ©aalist; 

# print "sorted aalist: $sorted_aalist\n" ; 



$BINS{Sbin} = $sorted_aalist ; 

> 

else 
{ 

# print •chucked\n"; 
delete $BINS($bin}; 

) 

> 

# print " SORTED BINSNn*; 

# $2=0; 

# foreach (keys %BINS) 

# ( 

# print -$z:$_:$BINS{$_)\n-; 

# $z + +; 

# } 

# now we update the cset table 

foreach $bin (keys %3INS) 

{ 

Scount = 0; 

# sum up bin hits; sajnple_size should equal length of bins 

for ($i=0; $i <. Ssample_size ; $i + +) 

{ 

$c = substr Sbin. $i. 1; 
if (Sc eq -1") 

( 

$count+*; 

} 



Skey = $BINS{$bin) ; 
(t print "cset key: $key\n"; 

if (exists Scsetvikey)) 
i 

Scset(Skey) += Scount ; 

> 

else 
( 

Scset(Skey) - Scount; 

) 

) 

tt print " CSETNn" ; 
« Sz = 0; 
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# foreach (keys %cset) 

# ( 

# print -$z:$_:$cset($_)\n" ; 

# $z + +; 
ft } 

ft print "Siter, BINS : scalar keys %BINS, - 

# print "CSETS: scalar keys %cset, '\n m ; 
print STDERR 'BINS : " , scalar keys %BINS, "\n"; 
print STDERR "CSETS : • , scalar keys %cset, "\n"; 



print "CSETS: scalar keys %cset, "An - ; 

print " \n\nGathering stats. \n"; 

foreach Sentry (keys %cset) 
( 

$h_total_obs - Scset {Sentry} ; 

$h_expected_total = &expected_size ( $sampie_s ize , Sentry) ; 

Scorrelation - &prob_of_correlat ion (Sentry, $h_total_obs , 

$h_expected_total , 
$sample_size , 
Siterations) ; 

if (Scorrelation < 0.000000001) 

( 

Scorrelation = 0.0; 

J 

if ($h_total_obs >= $min_freq) 

{ 

ft this is a weelly ugly hack to prevent hash key collisions 

$h = $h_total_obs; 

while (exists Soutput($h}) 

{ 

$h = $h . 

} 



ft print *\nEntry : $entry\n"; 

it print "Obsrv hits: $h_total_obs\n - ; 

ft printf "Expct hits: %.9f\n". $h._expecced_totai * Siteraticns; 

ft printf *Prob corrl : %.9f\n", Scorrelation; 



Soutput (Sh) 1 0 j = Sentry; 

Soutput ($h) tl] = $h_total„obs; 

Soutpuc(Sh) (2) = $h_expected_total * S iterations ; 

Soutput (Sh) [3] = Scorrelation; 



File coinc.pl: 14/15 



@hits = keys ^output; 
Qhits = sort compare Shits; 
ftGhits = sort Ghits; 

ft foreach (Gprobsi 
»( 

3 print " S_\n* ; 
ft) 

print " SORTEDNn" ; 
foreach Shit (©hits) 
( 

my t@aalist) ; 

ft S i = index Shit . " * • ; 
ft if (Si != -1) 
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$h =s substr Shit. 0, (index Shit, 



File coincpl: 15/15 



$s = Soutput (Shit) (0 J ; 
Baalist = split /\\/. $s; 
foreach (Gaalist) 
( 

$aa = substr $ — , 0, 1; 
$col_num = substr $_, 1; 
$__ = $aa - ($col_num + 1) ; 



Ss = join 



sort comp_aa @aalist ; 



print " \nEntry 



Soutput ($hit) [0] . "\n"; 



Sobserved = $output { Shit ) [1 ] 
Sexpected = Soutput ($hi t ) [2 J 
Sprob = Soutput ($hit) (3 ] 



if (Sexpected < Sobserved && Sprob < 0.5) 



print •\nEntry $s, " \n" ; 

print "Obsrv hits: « . Soutput { Shit } 1 1 J . "\n"; 
printf "Expct hits: %.9f\n", Soutput ( Shi t } [2] , 
print f "Prob corrl : %.9f\n" ( Soutpuc { Shit ) [ 3 ] ( 
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APPENDIX B 



TRPNNOTRKSVRIGPGQAFYATGDIIGDIRQAH HIV input: 1/10 

TRPNNYTRKMIPTGPGQVIYATGKIIGDIRKAY 

SRPNNNTRKSVHMGPGRAFYATGDIIGDIRQAY 

IRPGNNTRKSMHIGPGRPFYARG-VIGDIRQAH 

IRPNNNTRKSIHIGPGQAFYATGDIIGNIRQAH 

IRPNNNTRTSVHMGPGKTFYATGDIIGDIRQAH 

TRPNNNTRRSMRIGPGQTFYATGDI XGDXRQAY 

TRPNNNTRKSIRIGPGQAFYATGDIIGDIRQAH 

TRPSNNKRTSIHIAPGRAFYATGAI IGDIRQVH 

IRPNNNTRRSVRIGPGQAFYATGDI IGDIRQAH 

TRHNNNTRKS I RIG PGQAF YATGDI IGDIRQAH 

TRPSNNTRKSIRIGPGQAFYATGDI IGDIRQAH 

TRPNNNTRRSIHIGSGRAPY 1 IGDIRQAH 

IRPSRTTRKRWHIGSGQAFYAIDGITGDIRKAY 

TRPNNNTRRHMHIGPGRAFIATOAIVGDIRQAY 

TRPSNNTRKSVPIGPGQAFYATDDI IGDIRQAH 

TRPSNNTSKSIRIGPGQTFYATGRI IGDIRQAH 

IRPSNNTRKSVNIGPGQAFYATGDI IGDIRQAH 
TRPGNNTRKSVRIGPGQAFYATGD I IGDIRQAH 
TRPGNNTRKSWHIGPGRAFYTTDGI IGDIRKAY 

I RPGNNTRKGVH IG PGQAFYARGDI IGDIRQAH 
TRPGNNTRKSLRIGPGQTF YATGDI IGDIRQAH 
TRPNNNTRKSVR IGPGQAFYATGDI IGDIRQAH 
I RPNNNTRKSVH IGPGQAFYATGDI IGDIRQAY 
TRPNKNTRKSVRIG PGQTFYATGDI IGDIRQAH 
TR PGNYTRKSVRTG PGQTF YATGKI IGDIRQAH 
TRPNNNTRKGIHIGPGSAIYATGDI IGDIRQAH 
TRPNNNTRTGIHIGPGQTFYATGEIIGNIRQAH 
TRPNNNTRRSVRIGPGQTFYATGAI IGDIRQAH 
IRPNNNTRKSVRIGPGQTFYAAGDI IGDIRQAH 
TR PGNNTRRSVRIG PGQAFYATGEI IGDIRKAH 
TRLSNNTRKSVRIG PGQTFYATGEI IGDIRRAH 
TRPNNNTRKSVRIGPGQTFYATGDI IGDIRQAH 
TRPNNNTRTSVRIG PGQ AFYATGD I IGDTRQAH 
TR PGNNTRRSVR IG PGQAI YATGDI IGDIRKAH 
SRPNNNTRRS I H FG PGQTLYATGNI IGDIRQAH 
TRPNNNTRRS I RIGSGQTS YATGDI IGNIREAH 
SRPGNNTRKSVRIGPGQTF YATGDI IGDIRQAH 
TRPNNNTRKSVRIGPGQTFYATGDI IGDIRQAH 
TRPNNNTRKSVRIGPGQTFYATGDI IGDIRKAH 
TRP SNNTRKG I H IG PGRAFY ATGQ I TGD I RQ AH 
TRPGNNTNKNVH I G PGQ AF YARGR 1 1 GD I RKAH 
TRPNNNTRMSIRIGPGQAFYATGDI IGNIRQAH 
TRPNNNTRKS IH IG PGQ AFYATGD I IGN IRQAH 

TRPNNNTRTGI H IG PGQAF YARG A ITGDIRKAY 

TRPXNNTRKS I HIGPGQAFYATGD I IGDIRKAH 

TRPNNNTRTSIRIGPGQTFYATGDI IGNIRQAH 

TRPGNNTRTS IRIGPGQAFYGRGNI IGDIRKAH 

TR PNNNTRRS I RIGPGQAF YATGDITGD IRQAH 

ARPNNNTRRSIHIGPGQAFYA-SDI IGDIRQAH 

TRPNNNTRKSVHIG PGQ AFYATGD I IGDIRQAH 

TRPNNNTRKS IRIGPGQAFYTTGDI IGDIRQAH 
IRPNNNTRTS IRIGPGQAFYATGDI IGDIRQAH 

TR PNNNTRKSVP IG PGQAFY ATDN I IGDIRQAH 
TRPNNNTRTS I C IGPGQTFYA-GG I IGDIRQAH 
TRPrJNNTRKSVHIGPGQAFYVrGDIIGNIRQAH 
TRPNNNTRXS IH IGPGQAFYATGDI IGDIRQAH 
THPSNNTRTS IRIGPGQAFYATGDI IGDIRQAH 
TRPNNNTRKS ANIGPGQAFYATGEI IGDIRQAH 
IRPNNNTLKG IH IGPGQS FYATGS IVGNIRQ AH 
IRPYNNTRKSIHIGPGQAFYA-SRIIGNIRQAH 
TRPNNNTRKS I RIGPGQTFYA-GEI IGNIRQAH 
TRPNNNTRKGVH IGPGQAFYATGDI IGDIRQAH 
TRPNNNTRKSVRIGPGQAFYATGDI IGDIRQAY 
TRPNNNTRTS I RIG PGQSFHATGDI IGDIRQAH 
S RPNNNTRKSVH IGPGQAFYATGDVIGDIRQAY 
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IRPNNNTRKSVP IGPGRAFYATGDI IGNIRQAH HIV input: 2/ 10 

TRPNNNTRKGVR IGPGQAFYATGG I IGDIRQAH 

TRPNNNTRKSVRIGPGQAFYATGDIIGDIRQAH 

TRPNNNTRTSVR IGPGQTFYATGDI IGDIRRAY 

VR PNNNTRTSVR I G PGQTFYATGEI IGDIRRAF 

TRPNNNTRRS IRIGPGQAFYATGDI IGDIRKAH 

IRPNNNTRKSVH IGPGQAFYATGDI IGDIRQAH 

IRPNNNTRKSVH IGPGQTSYATGD I IGDIRQAH 

TRPNNNTRKSVH IGPGQAFYATGDI IGDIRQAH 

TRPNNNTRRSVH IGPGQAFYATGDI IGDIRRAH 

TRPNNNTRKS I HIX3PGRAFYATGDI IGDIRQAH 

SRPYN-TRKNYS IGSGQAFYVTGKI IGDIRQAH 

TRPYKKVRRRIHIGPGRSFY-T-SNLGDIRQAY 

TRPNNNI SRRIH IGRGQAFYATGGMTGNIRQAY 

IRPNNNTRKSVR IGPGQAFYATGDI IGNIRQAH 

TRPNNNTRRSVR IGPGQTFYATGDI IGDIRQAH 

TRPNNNTRTSVH I GPGQAFYARGD I IGDIRQAH 

TRPNNNTRKS I H IG PGQAFY ARGDI IGNIRQAH 
TRPNNNTRKSVH IGPGQAFYATG EI IGDIRQAH 
TRPNNNTRKSVR IGPGQTFYATGDI IGNIRQAH 
TRPNNNTRKGVH IGPGQAFYATGDI IGNIRRAH 
TRPNNNTRQSVH IG PGKAFYATGGIVGDIRQAY 
TRPNNNTRKSVHIGPGQAFYATGAI IGS IRQ AH 
TRPNNNTRRSVH IGPGQAFYATGDI IGDIRQAH 
TRPGNNTRRSVR IGPGQTFYATGDI IGDIRQAH 
IRPNNNTRTSVRIGPGQAFYATGDIIGDIRKAY 
TRPNNNTRKS IGICPGQTFYAADNI IGDIRQAH 
TRPGNNTRTSVR IG PGQAFYATGDI IGDIRQAH 
TR PNNNTRTSVR I G PGQSF YATGD I IGD I KQAH 
MRPNNNTRKS I SIGPGRAFFATGDI IGDIRQAH 
TRPSNNRRQSVR IGPGQAFYATGDI IGDIRRAH 
TRPNNNTSQGVH IG PGQVFYARDRI IGD I RKAY 
TRPNNNTRKSVRIGPGQTFYATGDIIGDIRQAY 
IRPNNNTRRGIHMGPGQILYATGSI IGDIRQAH 
TRPNNNTRKS I RIGPGQVFYTN-DI IGDIRQAH 
TRPNNNTRKSVH I G PGQAFYATGDI IGNIRQAH 
TRPNNNTRKS IRIGPGQAFYATGDI IGNIRQAH 
TRPNNNTRKS I RIG PGQVFYATG ********** 
TRPNNNTRKSVR IGPGQTFYATGDI IGD IRQ AH 
TRPNNNTRTSVRIGPGQAFYATCDI IGDIRRAH 
TRPNNNTRKS I H IGPGRAFYTTGEI IGDIRQAH 
TR PNNS KRKT LHMG P KRAF YATGD IGGY I RQAH 
TRPNNNTRKSIQIGPGRAFYTTGEI IGDIRQAH 

TRPNNNTRKG I HMGPGSTFYATGEI IGDIRQAH 

TRPSNNTRKGIKLGFGRALYATGEITGDIRQAH 

TRPNNNTRKSLSIjGPGRAFYTTGDIVGDIRQAH 

TRP SNNTRKG I H I G PGRTFF ATGEI IGDIRQAH 

TRPNNNTSKG I HMG PGGAFYTTGR I IGD IRRAY 

TRPNNNTRKS IS IGPGRAFYATGDI IGDIRQAH 

TRPNNNTRKGIHMGWGRTFYATGEIIGAIRQPH 

TRPNNNTRKS I HMGWGRAFYATGDI IGDIRQAH 

TRPNNNTRKS IHVGWGRSLFTTGEI IGNIRLAH 

TRPNNNTRKS Z HMGWGRAFY ATGEI IGDIREAH 

TRPNNNTRKR I Y I G PGRAVYTTGQI IGDI RRAH 
ERPNNNTRKS INIGPGRAFYTTGDI IGDIRQAH 

TRPSNNTRKSIHLGLiGRAFYTTGDI IGDIRQAH 
TRPHNNTRRS IT IGPGRAFYTTGDI IGDIRQAH 
TRPSNNTRKSIHLGWGRAFYATGEIIGDIRQAH 
TRLNNNTRTS I H I G PGQAFYATGDI IGDIRQAH 
TRPNNNTRKS IH I GPGSAFYATGDI IGDIRQAH 
TRPNNNTRKS I HMGWGRT F Y ATGE 1 1 GD I RQAH 
TRPNNNTRKG I H IGPGRAFYAT- EITGDI RQAH 
LRPSNNTRKSIHMGWGRAFYATGEI IGDIRQAH 
TRPNNNTRKS I HMGWGRAFYATGEI IGNIRQAH 
TRPGNNTRKGI P IGPGGSFYATERI IGDI RQAH 
I RPNNNTRRS 1 1 IGPGRAFYATGDI IGDI RQ AY 
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TRPNNNTRKSIHIGPGRAFYATGDIIGDIRQAH HIV Input: 3/ 10 

TR PNNNTXKS I H IGPGS AFYATGD I IGD I RQAH 

TR PGNNTRRS I HMG WGRAF YATGD 1 1 GDI RQAH 

TR PNNNTRKS I H I G PGRAFYATGD I IGD I RQAH 

TRPNNNTRKS I HMGWGRAF Y ATG E I IGN I RQAH 

TRPNNNTRKS I HIGPGKAFYATGEI IGNIRQA Y 

T R PNNNTRKS I H LGWGRAFYATG EI VGD I RQAH 

TRPNNNTRKS IT IGPGRAFYATGEI IGD IRQAH 

TR PNNNTRKS I HMGWGRT F YATGE 1 1 GDI RQAH 

TRPSNNTRKG IHIG PGRAFYATGDI IGDI RQAH 

TR PSNNTRKS I H I GWGRA I YATGAI I GD I RQAH 

TR PNNNTRKS I HVGWGRALYTTGEI IGN I RQAH 

TRPNNNTRKS I QYGTGGAFYATGEIVGD I RQAH 

TRPGNNTRKSIHIGPGRAFYTTGDI IGDI RQAH 

TRPNNQTRKS IHMGWGRAFHTNGEI IGN I RQAH 

TRPNNNTRKG I HMG LGRAFYATGG IVGD I RQAH 
TRPSNNTRKGIHIGWGRAFYATGEITGDIRKAY 
S R PNNNTRKS I HMGWGRAFYTTGE I IGDI RQAH 
TRPNNNTRKSIHIGPGRAFYTTGEIIGDIRQAH 
TR PGNNTRKS I HLGWGRAFY ATG A 1 1 GD I RQAH 
TRPSNNTRKSIHLGWGRAFYATGEIVGDIREAH 
TRPSNNTRRSIHLGPGGAFYTTGEIIGNIRKAF 
TRPNNNTRKS I RI G PGSAFYATGDI IGDI RQAH 
TRPNNNTRKS I PIAPGSAWFATGEI IGDIRQAH 
TR PNNNTRKS I H LGWGRAF YTTGQI IGE I RQAH 
TRPNNNTRKS I HVGVGRAI YATGEI IGDIRQAH 
TRPSNNTRKSIHMGWGRAFYATGEI IGDIRRAH 
TRPNNNTRKS I HMGWGRAF YTTGDI IGDIRQAH 
TRPNNNTRKRKS IG PGRAFYTTGEVIGDIRQAH 
TRPNNNTRKS IHMGPGSAIYATGEI IGDIRKAY 
TRPNNNTRKGIHIGPGRAFYTT-DIIGDIRQAH 
TRPNNYTSKRIRIGARRAFYTKGKIIGDIRQAH 
TRPNNNTRKG IHIG PGRAVYTTGRI VGD I RLAH 
TRPNNNTRKS IQRG PGRAFVTIGKI -GNMRQAH 
TRPNNNTRNRISIGPGRAFHTTKQIIGDIRQAH 
TRPNNNTRKS ITKGPGRVI YATGQI IGDIRKAH 
TRPYNNVRRSLS IGPGRAF * RTREIIGIIRQAH 
TRPNNNTRKS INIGPGRAWYAT-NI IGDIRQAH 
I RPNNNTRKS I PIG PGRAFYATGDI IGDIRQAH 
TRPNNNTRKS IH I GPGRAFYT-GEI IGDIRQAH 
TRPNNNTS KRI S IG PGRAFRAT- KI IGNI RQAH 
TRPNNSTRKRISIGPGRVWYTTGQIIGDIRKAH 
TRPNNNTRKRISIGPGRVWYTTGQIIGNIRKAH 
TRPNNNTRRSGHIGGGRTLFTT-HIVGDIRKAH 
TRPNNNTRKS IHIGPGRAFYT-GEI IGDIRQAH 

TRPNNNTSKRISIGPGRAFRAT-KIIGNIRQAH 

TRPNNNTRKR I S IG PGRAS YTTGQI IGDIRKAH 

TR PNNNTRKRI S IG PGRAWYTTGQ I IGDIRKAH 

TRPNNNTRRSGH IGGGRTLFTT- HIVGD I RKAH 

TRPSNNTRKS I PMGPGKAF YTTGDI IGD IRQ AY 

TRPNNNTRKS I H IG PGRTFFTTGDI IGDI RQAH 

TRPNNNTRKS IN IGPGRAFYATGEI IGN IREAH 

ERPNNNTKRSITIGPGRAFDAYGGI IGDIRQAH 

TRPNNNTRKS IHMGPGKAFYTTGEIVGDIRQAH 

TRPNNNTRKG I H IGPGGAFYATGGI IGDIRQAH 

TRLNNNTRKS INIGPGRAFYATRDI IGDIRQAH 

TRPNNNTRKS IHIG PGRS F YTTGDI IGDI RQAH 
TRPNNNTRKS IH I GPGRAFYTTGD 1 1 GDI RQAH 
TRPNDNTRKSI PMGPGKAFYATGDIIGNIRQAH 
TRPNNNTRKS I H IGPGRAF YTTGS I IGDIRQAH 
TR PNNNTRKG I T I G PGRAF Y ATEK I IGDIRRAY 
I RPNNNTRKS I P IGPGRAF YATGDI IGDIRKAH 
TRPNNNTRKS I P IGPGRAF YATGDI IGDIRQAY 
TRPNDNTRKS I H I GPGRAF YTTGQI IGN I RQAH 
TRPNNNTRKSIHMG PGSAFYATGDI IGN I RQAH 
TRPNNNTRKS I P IGPGRAFFTTGDI IGDIRQAH 
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TRPNNNTRKS IH IGPGRAFYATGDI I GD IRQ AH 
TRPSNNTRKG I HIGPGGAFYTTGEI IGDI RQAH 
TRPSNNTRKSIH IG PGRA F YAT- D I IGDI RQAH 
TRPKNE I KRRI KIG PGRAFVATGT- VGDTRQAQ 
TRPNNS I KRRIHIGPGRAFFATNT- VGDTRQAQ 
TR PDNE I RRS LQVG PGRAFVAAGT - AGDTRQAQ 
TRPGNNTRRSIHIGPGRAFFATGDITGDIRQAH 
TRPNNNTRKS ITIGSGRAFHAI EKI IGNIRQAH 
TRPSKTTRRR I H IGPGRAFYTTKQ I AGDLRQAH 
TRPNNNTRKSIRIGPGRAFVTIG-KIGNMRQAH 
TRPNNNTRKS I H IGPGKA FYATGEI IGDI RQAH 
TRPNNNTRKSIHIGPGSAFYTTGDIIGDIRQAH 
TRPNNNTRKRVTMG PGRVWYTTG E I IGNIRQAH 
TRPNNNTRKGIHLGPGGTFYATGEI IGDIRQAH 
IRPNNNTRKSINIGPGRAFYTTGEI IGDI RQAH 
TRPNNNTRRGIHIGLGRRFYT-RKI IGDIRQAH 
TRPHNNTRXSIHIGPGRAFYTTGEIIGDIRQAH 
TRPGNNTRRS I PIGPGKAFFTT— EI IGDIRQAH 
TR PNNNTRKS I H IG LGRAFYTTGD I IGDI RQAH 
TRPNNNTRKS I PIG PGRA FYATGEI IGDIRQAH 
TRPNNNTRKS I P I G PGRAF YTTG E I IGD I RQAH 
TRPNNNTRKS I HIGPGRAFYTTGEI IGNIRQAH 
TRPNNNTRRS IGIGPGRAI YATDRIVGNIRQAH 
IRPNNNTRKS I SIGPGRAFYATGEI IGNIRQAH 
TR PNNNTRKG I HIGPGRAF YAT ERI IGNIRQAH 
TRPNNNTRRGI H IGPGRAVYTTGKI IGDIRQAH 
TRPSNNTRRS IHIG PGRAFYTTGQ ITGNIRQAH 
TRPNNNTRKS IQ IGPGRAFYTTGEI IGNI RQAH 
TRPNNNTRKS IHIG PGRAF YTTGDI IGDIRQAH 
TRPNNNTRKS IH IGPGRAFYTTGEI IGDIRQAH 
TR PNNNTRKG I HIGPGRAFYTTGEI IGDIRQAH 
TRPNNNTRKS IHIG PGRAFYATGE I IGDIRQAH 
TRPNNNTRKRNTLGPGKVFYTTGEI IGDIRQAH 
IRPNNNTRKS IHIG PGRAFYTTGEI IGDIRQAH 
TRPNNNTRKS IHIG PGRAFYTTG E I IGDIRQAH 
TR PNNNTRKS I HIGPGRAFYATGEV I GDI RQAH 
TRPNNNTRKG IH IGPGRAFYTTGDI IGDI RQAH 
TRPNNNTRKS IH IGPGRAFYTTGEI IGDI RQAH 
IRPNNNTRKS IH IGPGRAFYTTGEI IGDIRQAH 
TRPNNNTRKS I PIGPGRAF YTTGDI IGNIRQAH 
IRPNNNTRKS I PIGIXJSAFYTT- EI IGDIRQAH 
TRPNNNTRKS I HMGPGKTFYTTGDI IGDIRQAH 
TRPNNNTRKS IHIGPGRAFYTTGQI IGD IRQ AY 
TRPNNNTRKS I P IG PGRAFYTTGEI IGDISQAH 
TRPNNNTRKS IH IGPGRAFYATGDI IGDIRQAH 
TRPNNNTRKS I H IGPGRAFYATCEI IGDIRQAH 
I RPGNNTRKS I P TG PGRAF YATGD I IGD I RQAH 
TR PNNNTRKG IRIGPGRAF I AATKI IGDIRQAH 
TRPNNNTRKS I P IGPGRAFYTTGDI IGDIRQAH 
TRPNNNTRKS IHIGPGKAFYATGE I IGDIRQAH 
TRPNNNTRKG IHIGPGRAFYATEAI IGDIRKAY 
TRPNNNTRKG I HIGPGKAFYTTGE I IGD I RQAH 
TRPNNNTRKS I HIG PGRAFYTTGEI IGDIRQAH 
TRPNNNTRKS INIGPGRAFYTTGGL.IGDIRQAK 
TRPNNNTRKS IH IGPGRAFYTTGEI IGDIRQAH 
TRPNNNTRKS I HIG PGRAFYTTGEI IGDIRQAH 
TRPNNNTRKS IH IGPGRAFYTTGEI IGDIRQAH 
TR PNNNTRKS IHIG PGGAFYATG E I IGDIRQAH 
TRPNNNTRRGIHIGPGRAFYTTGQI IGNIRQAH 
TRPNNNTRKG I H IGPGRAFYATGDI IGDIRQAH 
IS PNNNTRKS I HIGPGRAFYTTGEI IGDI RQAH 
TRPNNNTRKS I HIG PGRAF YTTGDI IGDIRQAH 
TRPNNNTRKSIHLGPGKAVYTTGEIIGDIRQAH 
TRPNNNTRKS I PIGPGRAF YTTG EI IGDIRQAH 
TR PNNNTRKS IHIG PG RAF Y ATG E I IGD I RQAH 
TRPNNNTRKS I HIGPGRAFYTTGEI IGNIRQAH 



HIV input: 4/10 
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TRPNNNTRKS I H IGPGRAF YATGDI IGDIRQAH fjjy inp U f 5/ 10 

TRPNNOTRKSIHIGPGRAFYTTGDIIGDIRQAH F 

TRPNNNTRKS I NIG PGRAFYATG E I IGDIRQAH 

TR PNNNTRKS I H I G PGRA F YATG E I IGD I RQ AH 

TRPNNNTRRS I PIG PGRAFYATGN I IGDIRQAH 

TRPNNNTRKS I NIG PGRAF YTTG EI IGD I SQAH 

TR P FNNTR KS I PIG PGRA FYTTGDI IGDIRQAH 

TRPNNNTRRS I HIGPGRAFVTTGG I IGDIRQAH 

TRPNNNTRKS I H IGPGRA FYTTGDI IGDIRQAH 

TRPNNNTRIG I H IG PGRAFYATGEI IGDIRQAH 

TRPNNNTRKS I NTG PGRA FYTTGDI IGDIRQAH 

TRPSNNTRKGIQIGPGRAFYTTGQITGDIRQAH 

TRPNNNTRKGIHIGPGRAFYATGEIIGNIRQAH 

TRPNNNTRKS I TIGPGRAFYTTGE I IGDIRQAH 

TRPNNNTRKSIHIGPGRAFYTTGEIXGDIRQAK 

TRPNNNTRiCSIHIGPGRAFYTTGEIIGDIRQAH 

TRPNNNTRKS I HIGPGRAFYATGEI IGDIRQAH 

TRPNNNTRKS I HIGPGRAFYTTG EI IGNIRQAH 
TRPNNNTRRG I H IG PGRAVYTTGE I IGNXRQAH 
TRPNNNTRKS IHIGPGRAFYATGDI IGDIRQAH 
TRPNNNTRKS I NIGPGRAFFTTGKI IGDIRQAH 
TRPSNNTRKXI HIGPGRAFYATGEI IGDIRQAH 
TRPNNNTSKGI H IGPGRAFYTTGDI IGDIRQAH 
TRPNNNTRKGIHIGPGRAFYATGEI IGDIRQAH 
TRPGNNTSRG I H IGPGRA FYTTXKI IGDIRQAH 
TRPNNNTRKS I N IG PGRAF YTTGD 1 1 GD I RQ AH 
TRPNNNTRKSIPMG PGRA FYTTGDI IGNIRQAH 
TRPHNNTRKSIPIGPGRAFYTTGEIIGDIRQAH 
TRPNNNTRKGIHIGPGRAFYTTGEI IGNIRQAH 
TRPNNNTRKS I H I A PGRAFYATG E I IGD IRQ AH 
TRPNNNTRKS IXIGPGRAFYATGEIIGDIRQAH 
TRPNNNTRKS I N IG PGRAF YTTG E I IGDIRQAH 
TRPNNNTRKS I PIGPGRAFYTTGQI IGDIRQAH 
TRPNNNTRKGIHIGPGKAFYATGEIIGNIRQAY 
TRPNNNTRKGIHIGPGSAFYATGEI IGDIRQAH 
TRPNNNTRKS IHIGPGRAFYTTGEI IGDIRQAH 
TRPNNNTRKSIHIGPGRAFYTTGDIVGDIRQAY 
TRPNNNTRKS I H IGPGRAFYATGEI IGD IRQAH 
TRPNNNTRKS I H IGPGRAFYTTGDI IGDIRQAH 
TRPNNNTRKSIHIGPGRAFYATGQI IGDIRQAH 
TRPNNNTRKGIHIGPGRAFYATGDI IGDIRQAH 
TRPNNNTIKSIHIGPGRAFYTTGQI IGDIRQAH 
TRPNNNTRKG I H IGPGRAF YTTG? I IGDIRQAH 
TRPNNNTRKS IT IGPGRAFYTTGDI IGDIRQAH 
TRPNNNTRRS I NIG PGRAFYATG EI IGDIRQAH 
TRPNNNTRKS IHI APGRAFYATGE I IGD IRQAY 
TRPNNNTRKS I H I G PGRAFYATG A I IGNIRQAH 
TRPNNNTRKS I HLGPGQAWYATGEI IGDIRQAH 
TRPNNNTRKS IHLGQGQAWYATGEI IGDIRQAH 
TRPNNNTRKS I HLG PGQAWYTTGQI I GDI RQAH 
TRPNNNTRKS I P LG PGRAWY ATG E I IGDIRQAH 
TRPNNNTRKS I PLG PGQAWYTTGQI IGDIRQAH 
TRPNNNTRKG I HLG PGQAWYTTGQI I GDI RQAH 

TRPNNNTRKS I PLGPGQAWYTTGQ I IGDIRQAH 

TRPNNNTRKS I PLC PGQVWFTTGQ I IGDI RQAH 

TRPNNNTRKS IHLGPGQAWYTTGQ I IGDIRQAH 

TRPNNYTRKXIXMGPGRXXYTTGEIIGDIRRAH 

TRPNNNTRKS I HLG PGRAWYTTGQ I IGDIRQAH 

TRPNNNTRKS I HLGPGRAWYTTGQI IGDIRQAH 

TRPNNNTRKS I PLG PGQAWYTTGQI IGDIRQAH 

TRPNNNTRKG I P IG PGRAF YTTGDI IGD I RQAH 

TRPNNNTSKG I P IGPGRAFYATGXI IGDIRQAH 

TRPNNNTRKG IHIGPGRAFYTTGEI IGDIRQAH 

TRPNNNTRKG I H IGPGRAFYTTGEI IGDIRQAH 

TRPNNNTRKGI H IGPGRAFYTTGEI IGDIRQAH 

TRPNNNTRKG I H IGPGRAFYTTGEI IGDIRQAH 
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TRPNNNTRKS I P IG PGRAFYTTGQ I IGDI RQAH HIV input: 6/ 10 

TRPNNNTRKGI H IG PGRAF YTTG EIVGDI RQAH 

TRPNNNTRKG I HIG PGRAF YTTGG I IGDI RQAH 

TRPNNNTRKS I HMGQGRAFYATGG I IGDIRQAY 

TRPNN^TTRKGIHI^PGQAWYTTGQIIGDIRQAH 

TRPNNNTRKG 7 PLGPGQAWYTTGQ I IGOIRQAQ 

TRLNNNTRKSI AIG PGRTVYATDR I IGDI RQAH 

TRPSKNIRRSIHIGSGRAFYTIEGVAGDVRKAY 

TRPNNNTRRGIHIGPGRAFYATGNI I GDI RQAH 

TR PSNNTRKS I H IG PGRVFHATG E I IGDI RQAH 

TR PNNNTRKR I YIG PGRAVYTTEQ I IGNI RQAH 

TRPGMNTRERI S IGPGRAFI ARGQ I IGDI RQAH 

TRPGNNTRKS I P IG PGRAF I ATSQ I IGDI RKAH 

IRPNNNTRKGIGIGPGRTVYTAEKIIGDIRQAH 

TRPNNNTRKSIHIGPGRAFYTTGEIIGDIRQAH 

TRPNI YRKGRIHIGPGRAFHTTRQI I ENIRQAH 

TRPNNNTRKSIHIGPGRAFYTTGEI IGDIRQAH 

TRPNKKTRKRITTGPGRVYYTTGEIVGDIRQAH 

TRPNNNTRKRITMGPGRVYYTTGQI IGDIRRAH 

IRPNNNTRKGINVGPGRALYTTGDI IGDIRQAH 
TR PNNHTRKR VTLG PGR VWYTTG E I LGN I RQAH 
TRPNNNTRKSITLGPGRAFYTTGDIIGDIRQAH 
TRPNNNTRKSIHIAPGRAFYTTGDIIGDIRKAH 
TRPSNNTRKS IHIGPGRAFYTTGEI IGDIRQAH 
TRPGNNTRKS I PMG PGRA FYATGDI IGDI RKAH 
TRPNYNKRKRIHIGPGRAFYTTKNI I GT I RQAH 
TRPNNNTRKG I A IGPGRTLYAREK I IGDIRQAH 
TRPNNNTRRRLS IGPGRAFYARRNI IGDIRQAH 
TRPNTKKI RH I H I G PGRAF YATGG IMGDI RQAH 
TRPNNNTRRS INI G PGRAF YTTGDI IGDIRQAH 
TRPNNNTSKRISIG PGRAFVAAREI IGDI RKAH 
I RPNNNTRKS I S IGPGRAFYTTGEI IGDIRQAH 
TRPNNNTTRS I H IG PGRA FY ATGDI I GD I RQAH 
TRPNNNTRKSITIGPGRAFYATGDIIGDIRQAH 
TRPNNNTRKS I YIGPGRAFHTTGR I XGDI RKAH 
TRPNNNRRRRITSGPGKVLYTTGEIIGDIRKAY 
I RPNNNTRKG I H IG PGKAF YTTGE 1 1 GN I RQAH 
TRPNNNTRKSINIGPGRALYTTGEI IGDIRQAH 
TR PNNNTRKG I HIG PGRAF YATGE I IGDIRQAH 
TRPNNNTRRS I PMG PGKAF YTT -EI IGN I RQAH 
TRPSNYTGKRLSIGPGRAFVATRKI IGDIRQAH 
TRPGNNTRKS I TMGPGKVFYA-GE I IGDIRQAH 
TRPNNNTRKS I PMG PGRAFYTTGE I IGD I RKAY 
VRPSNNTRQS I PIG PGKAF YATGEI IGDI RKAH 
TRPNNNTRRSVH IG PGSAL YTT - DI IGDI RQAH 
IRPNNNTRRS INMGPGRAFYTTGDI IGDIRQAH 
TRPNNNTRRS I HIG PGRAWYTTGKITGDI RQAH 
TRPNNNTRKRITMGPGRVLYTTGQI IGDVRRAK 
TRPNNNTRKS IHIAPGRAFYATGEI IGDIRQAH 
TRPNNNTRKG 1 H IG PGRAF YATGD I IGDIRQAY 
TRPSNNTRKG I P IG PGRAFYTTGG I IGD I RQ AH 
TR PNNNTRKS I H I A PGRA FY ATGG 1 1 G D I RQAH 
TRPNNNTRRSINMGPGRAFYTTGDI IGDIRQAH 
TRPSNNTRKS I TIGPGRAFYTTGEVIGD I RQAH 

TRPNNNTRRG I K IG PGRAF YTTGE I IGDIRQAH 

TRPNNNTRKS I P ^ PGRAF YATGD I IGDIRQAH 

TRPNNNTRKS I hxGPGKAFDAT-DI IGDIRQAH 

TRPNNNTRKS IHIGPGRAFYATG EI IGD I RKAH 

TRPNNNTRKG I HMGPGRAFYTTGAI IGDIREAH 

TRPNNNTRRS IT I GPGRAFY AT - D I IGDIRQAH 

TRLSNKTRRSIHIGPGRAFYAT-DI IGDIRQAH 

TRPNNNTRRSIHIAPGRAFYATGDI IGDIRQAY 

TR PNNNTSRR I S IGPGRAFT AREG I IGD I RQAH 

TRPNNNTRRS I HIG PGKAF YATGG I IGDIRQAH 

TRPNNNTRKS I H IGPGRAF YTTGDI IGDIRQAH 

TRPNNNTRKS I HIGPGRAFYATGDI IGDI RQAH 
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TRPNNNTRKSIHIGPGRAFYTTGDIIGDIRQAH HIV input: 7/ 10 

TRPNNNTRKSIHIGPGSAFYTTGDIIGHIRQAH 

TRPNNNTGKSIHLAPGRGFHATGEIIGNIRQAH 

TR PNNNTRKG I AIGPGRTVYATGR I IGD I RQ AH 

TRPNNNTRKS IH IGPGRAF YATGG I IGE I RQAH 

TRPNNNTRKGI PIGPGRAFYTTGDI IGDIRQAH 

TRPNNNTRKS IH I APGRAFYATGEI IGDIRQAH 

SRPNNNTRKGIHIGPGRAFYATGDI IGDIRQAH 

TRPGNNTRRS I H IG PGRAF YTTGE I IGNI RLAH 

TR PNNNTRKS I P I G PGRA F Y ATGD I IGDI RQAH 

TRPNNNTRKS IH IGPGRAF YTTGDI IGDI RQAH 

TRrNNNTRKGIH IGPGRAF YATGEI IGNIRQAH 

TRPNNNTRKS I HIGPGRAFYATGDI IGDI RQAH 

TRPNNNTRKGI H IGPGRAFYTTGEV IGNIRQAH 

TRPNNNTRKS I PMG PGKAMYATG E I IGDIRKAY 

TR PNNNTRKS IHIGPGRAFYTTGEIVGD I RQAH 

TRPNNNTRKSIHIGPGRAFYAT-DI IGDIRQAH 

TRPNNNTRKS I PMGPGRAF YTTG EVIGN I RQA Y 

TRPNNNTRKS I H IG PGRAFHTTGEV IGDI RQAH 

TRPNNNTRKS I NIG PGRAFYATG EI IGDIRQAH 

TRPNNNTRKSINIGPGRAFYTTGEI IGDIRQAH 

IRPNNNTRRSIHMGPGRAFYATGDIIGDIRQAH 

I RPNNNTRRSINIG PGRAF YTTGDI IGNIRQAH 
TRPGNKTIRSISMGPGRAF- RTGQIIGNIRQAN 
TRPNNNTRKS I PIGPGRAFYATGDI IGDIRQAH 
TRPNNNTRRS IH I APGRAFHATGNI IGDI RQAH 
TRPSNNTRKSVHIGPGRAFYTTGE I IGDI RQAH 
TRPNNNTRKS I HLG PGRAFYATG EI IGDIRQAH 
I RPNNNTRKS I H IGPGRAF YTTGDI IGD I RKAH 
TRPNNNTRKS I H IG PGRA F YTTG E I IGD I RQAH 
TRPNNNTRKS IH IGPGRAF YTTGQ I IGDIRQAH 
TRPNNNTRKS I PIGPGRAFYTTGDI I GDI RKAH 
TRPSNNTRRSIHMGLGRAFYTTGDI IGDIRQAH 
TRPNNNTRKGIHIGPGRAFYTTGQI IGDIRKAH 
TRPNNNTRRSI P IG PGRAF YTTGQI IGD I RQAH 
IRPNNNTRKS ITMGPGKVFYVT- DI IGDI RQAQ 
TRPSNNTRKRIAIGPGRAVYTTEQI IGDIRRAH 
ER PNNNTRKS I NIG PGRA FY ATGD I IGDIRQAH 
TRPNNNTRKS I RIG PGQT FY ATGD I IGDI RQAH 
TRPNNNTRKS I R I GPGQA FY ATG E I IGD I RQ AH 
TRPNNNTRKS I SLGPGQAFYATGDI IGN I RQAH 
TRPNNNTRES IRIGPGQTF YATGDI IGDI RQAH 
TRPNNNTRQSIRIGPGQTFYATGDI IGDIRQAH 
TRPNNNTRKS IRIG PGQTFYATGD I IGDI RQA Y 
TRPNNNTRKGVR I G PGQT FY ATGD 1 1 GD I RQAH 
TRPNNNTRKS IRIGPGQTFYATGDI IGDIRQAH 

TRPNNNTRKSIRIGPGQTFYATGDI IGDIRQAH 

TRPNNNTRKS I RIG PGQTFYATGDI IGD I RRAY 

TRPSNNTRKS IRIGPGQTF YATGEI IGD I RQAH 

TRPNNNTRKS LRIG PGQT FY ATGD I IGDIRRAH 

TRPNNNTRKSTRIG PGQTFYATGD I IGD I RQAH 

TRPNNNTRKSIRIGPGQTFYATGDI IGDI RRAY 

TRPNNNTRKS I RIG PGQAFYATGDI IGD IRQ AY 

TRPNNNTRKS I RIG PGQAFY ATND I IGN I RQAH 

TRPNNNTRQS I RIGPGQVFYATKDI IGDIRQAH 

TRPTNNT^OSIRIGPGQAF r ATKGI IGDIRQAH 

TRPNNNTRKSIRIGPGQTFYATGDI IGDIRQAH 

TRPNNNTRKS I RIGPGQAFYATGG I IGDIRQAH 
TRPNNNTRKSVR IGPGQTF YATGD I IGD I RQ AY 
TRPNNNTRKSVRIGPGQTFYATGDI IGNIRQAH 
TRPGNNTRKSMRIGPGQPFYATGDI IGNIRQAH 
TRPNNNTRKS I RIGPGQAFYATNDI IGDIRQAH 
TR PNNNTRKSMR I G PGQTFYATGD I IGNIRQAH 
TRPNNNTRKSVRIGPGQTFYATGDI IGDIRQAH 
VR PNNNTRKS I RIG PGQT FYATN* **** 
TRPNNNTRQSVRIGPGQAFYATKDI IGDI RQAH 
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TRPGNNTRKSIRIGPGCTFYATGDIIGDIRQAH HTV in pat: 8/ 10 

TRPNNNTRRSIRIGPGQVFYANNDIIGDIRQAH 

TRPNNNTRKS I RIGPGQTFYATNEI IGNIREAH 

ARPNNNTRKSMRIGPGQTFYATGDI IGDIRQAH 

TRPNNNTRKSVRIGPGQTFYATGDIIGDIRQAH 

TRYANNTRKSVR I GPGQTFY-TNDI IGDIRQAH 

ARPNNNTRESIRIGPGGTFYATGOIIGDIRQAY 

TRPNNNTRKR I RVGPGQTVYATNAI IGDIRQAH 

TRPSNNTRKS I RIG PGQ AFYATGG I IGN I RQAH 

ARPGNNTRKS IRIGPGQTFFATGAI IGDIRQAH 

TRPNNNTRKSIRIGPGQTFYATGDIIGNIRQAH 

TRPYNK1RQRTHIGPGQALYTT-RI IGDIRQAH 

TRPNNYKRQGTPIGLGQALYTT-RVIGDIRKAH 

TRPNNNTRQGTHIGPGQALYTT-GVIGDIRKAH 

TRPYNNTRQSTRIGPGQTLFTT- KI IGDIRQAH 
TR PYNNTRQGTH IG PGRAYYTT -NI IGDI RQAH 
TRP YNNTRQGTHIGPGQTLFTT-KI IGDIRQAH 
TRPYNNKRQRTPIGLGQVLHTT-RVKGDIRQAH 
TRPYSRVRQGAHIGPGRAYYAT-NIFGDIRQAR 
TRPSNNTRQSTRIGPGQALYTN-KI IGN I RQAH 
ARP YNNTRQSTRIG PGQALFTS-KI IGNIRQAH 
TRPYENMRQRTPIGLGQALVTS-RIKGRIRPAY 
TRPYNNTRQGTHIGPGRAYYTT-RILGNIRQAH 
TRPYNNTICGTHIGPGRAYYTTISVIGDIRQAH 
TR P YNNT I QKT S I GRGQAI* YTT - ETRGDI KQAF 
TRPYNNIRQRTPIGSGQALYTT-RR IGDI RQAY 
TR PYNNTRQGTH IG PGRAYYTT- RIVGN I RQAH 
TRP YNNTRQSTHFGPGRAYYTT- D I IGDIRQAH 
TRPNNNTRQSTQ IGPGQALFTKTRI IGDIRQAH 
TR P Y ENVRHRT P IG LGQAL ITN - RI KAKIGQA Y 
TR P YNQ I RQRTS I GQGQALYTT- RVTGD I RKA Y 
TRPYNNTRKG I HIG PGRAYYTT -NIVGNI RQAH 
TRPYDKVSYRTPIGVGRASYTT-RIKGDIRQAH 
TRP YNN I RQRT P IGLGQAL YTT- RRI ED I RRAH 
IRPYNNTREGTHIGPGRALFTT-DI IGDIRQAH 
AR P Y A I ERQRT P I GQGQVLYTT -KKIGRIGQAH 
TRPNNNTRQSTHIGPGQAIYTLTKVVGDIRQAH 
SRP YENKRRRTP IGLGQAYYTT- KLKG Y I RPAH 
TRP EKI KRRGT P IGLGQAYLTT - Q I TGYIRQ AH 
TRP YRNI RQRTH IGTGQ AYYTK-G I KGVAGQ PH 

IRPNKTKIQRTSIGLGQALYTNDKI IGN I RQAY 
AR P Y I K I WRRTH IG SGQAYSTX - R IQNYTG PAH 
TRPKNITIQRTPIGLGQALYTT-KRIGVIGQAS 

SRPRNVTIQRTSIGSGQALYTT-KR1GYIKQAH 

TRPYHNKIQRTHIGTGQALHTT-RITGYIGQAH 

TRP YYN IRQRTP IGIiGQALYTTRGTTKVIGQAH 

TRPYNKTSQRTSIGQGRALYTT-KPTGYIRQAY 

S RP YKSTR I RTH I G SGQ AYYRT - N I QGD I RQAY 

TR P Y RAMRRRT S I GQGQ A YYTTTG I GGN I RQAY 

TR P Y SNKRQST P IGbGQALYTT- RGRGDI RKAH 

ARPYEKKRRTTP IGLGQAL ITS -RNFEKIGQ AH 

TRPYKS 1 RRICPGRWCTYY — TTNITGRAH 

I RPNKRTRQRTH IGSGQALYTT - KI VGDIRQAH 

TRPDH I KRQRT P IGQGQALYTTRLTTRRIGQPH 

MR PYNN iCRQSVH IG PGRAFYTT - N I IGDIRQAH 

TRP YNNTROGTH I G PGRA YWTT -NI IGDI RQAH 
TRP YNNTRQG I HIG PGRA YYTD-QITGDI RQAH 
TRPSNNTRKS I HIGPGQALFTI -DI IGN I RQAH 
TRPNNNTRQSTHIGPGQALYTT-KIIGDIRRAH 
TRPANNTRQSVHLGPGQALYTT-RVIGDIRQAY 
TRPYNNIKIQTPIGRGQALFTT-RIKGIKGQAH 
TRPNNNTRQS IHIGPGQALYTT-NVIGDIRQAH 
TRP YTNKRQGTHMG PGRALYTI - D ITGDI RQAY 
TRP YNNTRQSTH IGPGQALYTT -NI IGDIRQAH 
VR P Y SNQRRRT P IG LGQAL YTTMDNMKNI KQAY 
TRPYNN I KI QTP IGRGQALFTT-RRKGIKGQAH 
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TRPYTKTRH RAQGRAWWTTGITGDIRQAY HIV input! 9/10 

ARPYENI RQRTPIGTGQALYTTKK- IGKIGQAH 

TR P YS KERLKTS IGOGQALYTTVKVTGDI RQAH 

ARP YQNTRQRTP IGLGQSL YTT- RSRS I IGOAH 

TRPNKITRQSTPIGLGQALYTT-RIKGDIRQAY 

TRPGNNTRRGIHFGPGQALYTT-GIVGDIRRAY 

TRPYKYTRQRTSIGLRQSLYTIKKKTGYIGQAH 

TRP YRNI RQRTS IGLGQALYTT-KTRS I IGQAY 

I R PNNNTRQSTHLG PGQ ALYTT - KVTGD I RQA Y 

TRPNNNTRKSIHIGPGQAIYTT-DVIGDIROAY 

TRPNNNTRKGIHIGPGQALYTSGDIVGDIRQAH 

TR PNNNVRQRTP IG PGQAFYTTG * * ******** 

TRPSNNTRTS ITIGPGQVFYRTGDI IGDIRKAY 
TRPFKNMRTSARIGPGQVFYKTGS ITGDIRKAY 
TRP FKKVR I S ARIGPGR VFHTTGNINGDI RKAY 
TRPFKRVRTSVRIGPGRVFHKTGAINGDIRKAY 
TRP SNNTRTSVR IGPGQVFYKTGDI IGDI RRA Y 
TRPFKKTRI SARIGPGRVFHKTGAILGDIRKAF 
TR PSNNTRTSVRIGPGOVFYKTGEI IGDIRKAF 
TRPSNKIRTSVRIGPGQVFYKTGAIMGDIRKAF 
TRPSNNIRTSVRIGPGQVFYKTGSITGDIRKAF 
TRPFKKMRTSVRIGPGRVFYKTGS ITGDIRKAY 
TRPYKNTRTSARIGPGOVFYKTGS ITGDIRKAY 
TR PSNNTRTSVR IGPGQVFYGTGEI IGDIRRAF 
TRPSTTIRTSSRIGPGOAFYKIEGISGNIRAAY 
TRPSNNTRTR ITIGPGQVFYRTGDI IGDI RKAY 
TRPSNNTRTSITIGPGQIFYRTGDI IGDIRKAY 
TRPSNNTRTSITIGPGQVFYRTGDI IGDIRKAY 
TRPSNNTRTS ITIGPGQVFYRTGDI IGDIRKAY 
TR PSNNTRTS ITIGPGQVFYRTGDI IGNIRKAY 
TRPSNNTRTS I TIGPGQVFYRTGDITGNI RKAY 
TRPSNNTRTS I P IG PGQVFYRTGDI IGNIRKAY 
TR P S NNTRT S ITMG PGQ VFYRTGD I IGD I RRA Y 
TRPSNNTRPS ITIGPGQVFYRTGDI IGDIRKAY 
TRPSNNTRTS ITIGPGQVFYKTGDI IGNIRKAY 
TRPSNNTRTS I P IG PGQVFYRTGDI IGDI RKAY 
TRPSNNTRTS I TIG PGQVFYRTGDI IGDI RKAY 
TR P SNNTRTS I P IGPGQAFYRTGDI IGDIRKAY 
TRPSNNTRTS ITIG PGQVFYRTGDI IGNIRKAY 
TRP SNNTRTS IT IGFGQVFYRTGDI IGDI RKAY 
TRP SNNTRTS ITIGPGQVFYRTGDI IGDIXKAY 
TRPSNNTXPS ITXGPGQVFYRTGDI IGDIRXAY 
TRPSNNTRTS IT IGPGQVFYRTGDI IGDIRKAY 
TRP SNNTRTS I NIG PGQVFYRTGDI IGDI RKAY 
TRP SNNTRTS I TVG PGQVF YRTGD ITGD I RKAY 
TRPSNNTRTS IP IGPGQVFYRTGDI IGDIRKAY 
TRPSNNTRTSITIGPGQVFYRTGDI IGD I RQA Y 
TRPSNNTRTS INIGPGQVFYRTGDI IGDIRKAY 
TRP SNNTRTS IT IGPGQVFYRTGDI IGDIRKAY 

TRPSNNTRTSITIGPGQVFYRTGDI IGNIRKAY 

TRPSNNTRTG IT IGPGQVFYRTGDI IGDIRKAY 

TR P SNNTRTS IT I GPGQ I F YRTGDI IGD I RKAY 

TRPSNNTRTS I T I G PGQVF YRTGD I IGDI RKAH 

TRP SNNTRTS LT I G PGQVFYRTGDI IGDIRKAY 

TRPSNNTRTS LT IGPGQVFYRTGDI IGDIRKAY 

TRPSNNTRTS IT IGPGQVFYRTGDI IGDI RRA Y 

TRPSNNTRTS INIGPGQVFYRTGDI IGDIRKAY 

TR P SNNTRTS ITIG PGQVL YKTGD I IGD I RKAY 

T3.PSNNTRTSTTIGPGQVFYRTGDITGNIRKAY 

TRPSNNTRTSVRIG PGQVFYRTGDI IGDIRKAY 
TRP SNNTRTS IT IGPGQVFYRTGDI IGNIRKAY 
TRPNNNTRKS IHLGPGQAFYATGDI IGDIRKAH 
TR PNNNTRKS IQLG PGRAFYTTGE I IGDIRKAH 
TRPNNYTRKS IYFGPGRAFHTAGKI IGDIRKAH 
TRPNNNTRKGIHIGPGRAFYATGDI IGDIRKAH 
TRPNNNIRKSIPLGPGRAFYATGEIIGDIRKAH 
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TRPSKTIRRRIRIGLGRVFYAT-GVNGDIRKAY ¥TTV innnt- 1ft/ in 

TRPNNNTRKS IHIGPGRAFYATGDI IGDIRKAV ' F W 1W 

TRPNNNTRKSIRIGPGQVFYATGDIIGDIRKAY 

TRPNNNTRKGIRIGPGRVIYATSAITGDIRQAH 

TRPNNNTR KS I H LG PGQA FY ATGD I IGOIRKAH 

TRPNNNTRKS I H LG PGQAFYATDDI IGOIRKAH 

TRPNNNTRKS IHLG PGQAFYATGDI IGDIRKAY 

TRPNNNTRKS IHLGPGQAFYATGDI IGDIRKAH 

TRPNNNTRKS IHLGPGQAFYATGDI IGDIRKAY 

TRPNNNTRKS I HLG PGQAFYTTGDI IGDIRKAH 

TRPNNNTRKS IHLG PGQAFYATGDI IGDIRKAH 

TRPNNNTRKS IHLGPGQAFYATGDI IGDIRKAH 

TRPNNNTRKS IHLGPGQAFYATGDI IGDIRKAH 

TRPNNNTRKS IHLG PGQAFYATGDI IGDIRKAH 

TRPNNNTRKS I HLG PGQ AFYATGXI IGNIRKAY 

TRPNNNTRKGIHIGVGRPFYRTVDIVGDIRKAH 

TRPNNNTRKS I HLG PGQAFYATGDI IGDIRRAY 

TRPNNNTRKS IXLGPGQAFYTTGNI IGDIRKAH 

TRPNNNTRKS I H IG PGQAFYATGDI IGN I RKAH 

TRPNNNTRKS IHLG PGQAFYATGNI IGDIRKAH 

TRPNNNTRKSIHIGPGQAXYTTGDIIGDIRKAH 

TRPNNNTRKS I HLG PGQAFYATGDI IGDIRKAH 

TRPNNNTRKS I HLG PGQAFYTTGDI IGDIRKAH 

TRPNNNTRKS IHLG PGQAFYTTGDI IGDIRKAH 
TRPNNNTRKS I HLG PGQAFYATGDI IGDIRKAY 
TRPNNNTRKS I H LG PGQAFYATGDI IGDIRKAH 
TRPNNNTRKS I HLG PGQAFYATGDI IGDIRKAH 
TRPNNNTRKS I HLG PGQAFYATGDI IGDI RKAH 
TRPNNNTRKS I HLGPGQAFYATGG I IGN I RKAH 
TRPNNNTRKS I HLG PGQAFYATGDI IGDI RKAY 
TRPNNNTRKS I HLG PGQAFYATGDI IGDI RKAH 
TRPNNNTRKS I HIG PGQAFYATGDI IGDIRKAH 
TRPNNNTRKS I H IGPGQAFYATGDI IGDIRKAH 
TRPNNNTRKS I HIGPGQAFYATGEVIGDI RKAH 
TRPNNNTRKS I HLG PGQAFYATGDI I GDI RKAH 
TRPNNNTRKS IHLG PGQAFYATGDI IGDIRKAH 
TRPNNNTRKS I HLG PGQ AFYTTGE I IGDIRKAH 
TRPNNNTRKS I TIG PGQAFYATGDI IGDI RQ AH 
TRPNNNTRKS I SFG PGQAFYATGDI IGDI RQAH 
TRPNNNTRKS I H IG PGQ AL YATGA I IGDI RQAH 
TRPNNNTRKSIKFGTGRVLYATGAI IGN I RQAH 
TRPNNNTRKS I RIGPGQAFYATGEI IGD I RQAH 
TRPNNNTRKS ITLGPGQAFYATGDI IGNIRQAH 
TRPNNNTRKS ITFAPGQAFYATGDI IGNI RQAH 

I RPNNNTRKS I PIG PGQAFYATGDI IGD I RQAH 
TRPNNNTRKS I S IG PGQAF YATGD I IGDI RKAY 
TRPNNNTRKS IS IGPGQAFYATGDI IGD I RKAY 
TRPNNNTRRSMRIGIGRGQTFHGAI I GDI RQAH 
TRPNNNTRKS I R IGPGQAFYATGDI IGD I RQAH 
T R PNNNTRKS I N I G PGRA F Y ATG D 1 1 GD I RQ A Y 
TRPNNI RNIRTH IGSGQAI FTT- lO/IGDI RKAY 

TRPNNNTRTS I H LG PGRAF YATGD I IGD I RQAH 

TRPGNTTRRSMRIGPGRTFYTI GDI RKAH 

TRPNNNTRKSVRIGPGQTFYATGDKKGDIRQAK 

TRPNNNIRK3IRIGPGQAFFATGDI IGN IRQAQ 

TRPNNNTRKS I RFG PGQ AFYT-SDI I GDIRQAY 

TRPNNNTRRSIHVG PGQAFYATGDI IGNIRKAH 

TRPSNNTRRSIRFGPGQAFY-TNDI I GDIRQAY 

TRPGSDKKIRIRIGPGKVFYAKGGITG QAH 

ERPGIDIQE-IRIGPMA-WYSMGLGGTSSRAAY 

ERPQIDIQE-MRIGPMA-WYSMGIGGTSSRAAY 
IREI AEVQD- 1 YTGPMR-WRSMLKRSNPRSRVA 

^RPGNQTIQKIMAGPMA-WYSM — NTKRA — AY 
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APPENDIX C 



HIV output: 1/6 



0 


j 0.00 


1 


j 0 . 00 | 


2 


| 0 . 01 j 


3 


j 0.01 


4 


j 0.01 | 


5 


j 0.01 | 


6 


| 0.02 j 


7 


j 0.02 j 


6 


| 0.02 j 


9 


1 0.02 | 


10 


! 0.03 J 


11 


i 0.03 j 


12 


0.03 j 


13 


0.03 j 


14 


0.04 j 


IS 


0.04 1 


16 


0.04 j 


17 ! 


0.04 | 


18 | 


0.05 | 


19 | 


0.05 j 


20 j 


0.05 j 


21 


O.OS | 


22 j 


0.06 j 


23 | 


0.06 j 


24 | 


0.06 j 


25 | 


0.06 j 


26 | 


0.07 | 


27 | 


0.07 j 


28 j 


0 .07 j 


29 j 


0.07 j 


30 | 


0.08 j 


31 I 


0 .08 j 


32 | 


o.oe | 


33 j 


0.09 | 


34 | 


0.09 | 


35 j 


0 .09 | 


36 ( 


0 .09 | 


37 | 


0.10 | 


38 | 


0.10 j 


39 | 


0 . 10 | 


40 | 


0.10 j 


41 j 


0.11 | 


42 | 


0.11 | 


43 | 


0.11 | 


44 j 


0.11 | 


45 j 


0 . 12 ; 


46 } 


0.12 I 


47 j 


0.12 j 


48 | 


0.12 | 


49 | 


0.13 I 


50 | 


0.13 | 


51 i 


0.13 t 


52 


0.13 | 


53 


0 . 14 ) 


54 


0.14 t 


55 


0. 14 j 


56 


0 . 14 | 


57 


i 0 . 15 i 


56 


1 0. 15 | 


59 


! 0.15 | 


60 


| 0.15 | 


61 


t 0.16 j 


62 


1 0. 16 j 


63 


1 0.16 \ 


64 


\ 0-16 | 


65 


1 0.17 1 



S A18|Q31|H33 S 4 36019 & 15684.208314 4 0.000000 4 Vcr 
$ A18|T21 S 4 33816 4 123S2. 399254 4 0.000000 & Vcr 
$ A21|D24 $ 4 45549 4 17706.407140 4 0.000000 4 Vcr 
$ H12|A18 S 4 86025 4 24619.776947 4 0.000000 4 \cr 
S H12|R17 S 4 48257 4 19028.783592 4 0.000000 4 Vcr 
$ I11|R17 $ 4 64548 4 27053.952336 4 0.000000 4 Vcr 
S L13|K31 $ 4 39382 4 17335.347894 4 0.000000 4 Vcr 
S L13|W19|Q24 $ 4 20184 4 379.160544 4 0.000000 4 Vcr 
$ M13|W1S S 4 23300 & 6673.177086 4 0.000000 4 Vcr 
$ N4|K9 $ 4 162152 4 74737.922307 4 0.000000 4 Vcr 
$ N4|K9|H33 $ 4 26376 & 5666.716129 4 0. 000000 4 Vcr 
86891 4 17162.233105 4 0.000000 & \cr 
23319C 4 186078.818611 £ 0.0Q0O0O 
53740 & 10564.956512 4 0.000000 & 
& 183S9. 197022 & 0.000000 4 
& 27136.429076 4 0.000000 
& 10413.255892 4 0.000000 
4 26805.242087 4 0.000000 
& 16232.354294 4 0.000000 
4 17415.113746 4 0.000000 
4 18975.126308 4 0.000000 
VlljR12iT18 $ 4 17628 4 881.251263 & 0.000000 
K31|Y33 S 4 36346 4 20803.634880 4 0.000002 4 
4 4S441 4 30227.409858 4 0.000003 6 
4 25033 4 10875.740384 4 0.000018 



$ Q17|D24 
$ Q3i|H33 
$ R12|Q17 
$ R12|T18 
$ R17|A18 
$ R17|E24 
$ R17|Q31 
$ R17|T21 
$ S10|D24 
Vll |R12 



& 
4 
4 

& 62774 
4 54366 
4 33748 
4 45065 
4 70301 
4 57772 
& 39546 



N4|A21 $ 
Q17|K31 $ 
G10|H12 S 
K9|A21 $ 
F19|D24 S 
$ Q17|A21 $ 
$ H12[E24 $ 
$ N4|K9|I11 



\cr 
Vcr 
Vcr 
Vcr 
Vcr 
Vcr 
Vcr 
Vcr 
4 Vet 
Vcr 
Vcr 
4 Vcr 



& 
& 
& 
4 
& 
4 



4 20779 4 7151.794446 4 0.000041 & Vcr 

4 40098 4 27695.038620 4 0. 000231 & Vcr 
4 29121 4 16875.538795 4 0.000286 4 Vcr 
4 29621 4 18109.021417 4 0.000737 4 Vcr 
4 22348 4 10939.327036 4 0.000839 4 Vcr 

5 4 15175 4 4159.316971 4 0.001355 4 Vcr 
S S4 |T9|T12 |V18|R21 $ 4 10919 4 1.718549 4 0.001524 4 Vcr 
$ N4|K9|A21 S 4 11233 4 623.181959 4 0.002185 4 Vcr 

$ N4|Q31|H33 $ 4 21868 & 11328.342993 4 0.002369 4 Vcr 
S F19|A21 $ 4 44400 4 34516.144368 4 0.004910 4 Vcr 
$ K9|Q31|H33 $ 4 16593 4 6991.723713 4 0.00662S & Vcr 
$ W19JQ24 S 4 16738 4 7234.038664 4 0.007331 4 Vcr 
$ E11N12 $ 4 10814 4 1492.S35945 4 0.008575 4 Vcr 
K9|E24 $ 4 13847 & 4587.312260 4 0.009408 4 Vcr 
K9|E17 $ 4 33735 4 24568.179150 4 0.010326 4 Vcr 
T12|V18 S 4 23076 4 14893.617567 4 0.026158 4 Vcr 
R12|A21 S 4 15497 4 7516.155B96 4 0.031231 & Vcr 
N4|K9 !Q31 |H33 S 4 8280 4 493.681367 4 0.036905 & Vcr 
N4|K9|A18 S 4 11655 4 4250.900600 4 0.050618 4 Vcr 
S4 jT9 |T12 (V18 jR21 | Y33 $ 4 7370 4 0.093039 4 0.052029 & Vcr 
R12iQ17|T18 $ 4 7452 4 240.364918 4 0.058992 4 Vcr 
S V11|Q17 S 4 14350 & 7329.962834 4 0.068429 & Vcr 
& 23263 4 16324.923094 4 0.072825 4 Vcr 
4 17288 4 10374.788061 4 0.074203 4 Vcr 
4 15536 4 8921.243955 & 0.092437 4 Vcr 
4 6529 4 138.997153 4 0.108375 4 Vcr 

3 4 10228 4 38S4.61209S 4 0.112708 4 Vcr 

4 6573 4 275.512362 4 0.115524 4 Vcr 

$ R17JQ31|H33 $ & 7265 4 1223.984346 4 0.137235 4 Vcr 

$ T9 )T12 |V18 (R21 S 4 6003 4 30.417B27 4 0.143515 4 Scr 

$ N4 }K9 | A18 |H33 $ 4 6380 4 549.756091 4 0.157254 4 Vcr 

S S10!F19|D24 S 4 6150 4 620.344S4B 4 0.189437 i Vcr 

$ IlljRl7tA18 S 4 6555 4 1027.737537 4 0.189642 4 Vcr 

S V11|R12|Q17 $ 4 5751 4 247.598509 4 0.192378 4 Vcr 

S S4|T9',V18tR21 $ 4 5514 4 35.313082 4 0.195240 4 Vcr 

$ S4|T9|T12|V18|R21|K31 $ 4 5462 4 0.090571 4 0.197200 4 Vcr 

S H3.21R17|A18 $ 4 5618 4 172.948903 4 0.199184 4 Vcr 

$ 09|Tll|L19|-23 S 4 5464 4 38.188997 4 0.201464 4 Vcr 

S Y4(Q9|Tll|-23 $ 4 5364 4 35.276055 4 0.213243 4 Vcr 

$ N4lAl8|Q31|H33 S 4 637B 4 1180.344841 4 0.229871 4 Vcr 

S L3|N12|R23 $ 4 5114 4 15.794611 4 0.243044 4 Vcr 



S H12|T21 S 
S Q17(Y33 S 
S L13|W19 $ 
S Sl7jK28 5 
$ N4lK9|Q31 
S X8|S17 $ 
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66 


1 0.17 


67 


0.17 1 


68 


0.18 j 


69 


1 0.18 1 


70 


j 0.18 


71 


j 0.18 


72 


1 0.19 | 


73 


1 0.19 1 


74 


1 0.19 [ 


75 


1 0.19 | 


76 


j 0.20 | 


77 


! 0.20 j 


78 


0 . 20 1 


79 


0 . 20 i 


80 


0.21 | 


81 


0.21 | 


82 


0.21 1 


83 


0.21 | 


84 1 


0.22 j 


85 | 


0.22 j 


86 J 


0.22 | 


87 | 


0.22 


88 


0.23 j 


89 | 


0.23 


90 1 


0.23 | 


9 1 | 


0.23 


92 f 


0.24 | 


9 3 


0.24 | 


94 j 


0.24 I 


95 


0.24 


96 1 


0.25 1 


97 


0.25 


98 1 


0.25 


99 f 


0.26 | 


100 i 


0.26 J 


ioi i 


0.26 


102 


0.26 


103 j 


0.27 } 


104 


0.27 


105 | 


0.27 j 


106 1 


0.27 


107 i 


0.28 | 


108 | 


0.23 J 


109 j 


0.28 j 


110 | 


0.23 j 


111 j 


0.29 | 


112 \ 


0.29 [ 


113 I 


0.29 | 


114 j 


0 .29 1 


115 | 


0.30 j 


116 » 


0.30 1 


117 | 


0.30 1 


118 | 


0.30 | 


119 j 


0.31 i 


120 ! 


0.31 1 


121 


0.31 I 


122 


0.31 | 


123 


0.32 ) 


124 


0.32 | 


125 


0 . 32 j 


126 


0. 32 1 


127 


0.33 | 


128 


1 0.33 j 


129 


1 0.33 j 


130 


1 0.34 | 


131 


1 0.34 [ 



4 S V13|V15|I19 $ 4 5095 & 4.314940 & 0.244059 & \er 
4 S R12|017|D24 $ 4 5088 4 122.811489 & 0.261410 & Vcr 
4 S S4|T9|V18 $ 4 5671 & 868.949180 4 0.285090 4 \cr 
4 $ G24|E28 S 4 5363 & 579.114112 4 0.297805 4 Vcr 
& $ S4|T9|R21 S 4 5425 & 650.238601 4 0.289174 & Vcr 
4 $ K9|I11|R17 S 4 5315 & 590.615207 4 0.296804 4 \cr 
4 $ V18|K31|Y33 $ 4 5524 & 852.751002 4 0.304979 & Vcr 
4 $ T21|E24 $ & 19192 & 14557.811161 4 0.310756 & \cr 
& $ S4|T9|T12|V18|R21|K31|Y33 $ 4 4390 4 0.004904 4 0.350351 4 
4 $ S4|T9|V18 |R21|Y33 $ 4 4341 4 1.910712 4 0.358927 4 \cr 
4 $ I11|H12|A18 $ 4 5225 4 890.707158 4 0.359740 4 \cr 
& $ H12|L13 $ & 9314 & 5009.363342 4 0.364791 & \cr 
& $ M1|S12|F20 $ 4 4243 4 17.800459 4 0.378494 4 \cr 
& $ Y4|Tllj-23 $ & 4876 & 710.489341 & 0.388952 4 Vcr 
4 $ H12|A18|H33 S 4 S292 & 1141.301814 & 0.391569 & Vcr 
N12|G24|L25 $ & 4169 & 18.987442 & 0.391690 4 \cr 
N12|T13 $ & 5365 4 1255.021021 & 0.398803 & Vcr 
N4|K9|G23 $ & 9804 & 5726.074196 & 0.404540 & Vcr 
P12|L13|W19|Q24 S & 4070 & 20.998880 & 0.409748 & \cr 
Q12| Y13 |T15|G17|V26 $ & 4024 & 0.000255 4 0.414274 4 \cr 
S10|F19|A21 $ 4 5598 4 1607.067572 & 0.420292 & Vcr 
K9|H12 $ 4 26788 & 22912.753561 4 0.441631 4 \cr 
S10|Q17|D24 $ 4 3960 4 93.803024 4 0.443318 4 \cr 
4 133.098101 4 0.452738 4 Vcr 
450.472945 4 0.457896 4 Vcr 
4 3784 4 1.646276 4 0.459063 4 \cr 
639.401728 4 0.462612 4 \cr 
507.820002 4 0.468770 4 \cr 
4 4450 4 726.677198 4 0.470266 4 \cr 
4 4413 4 691.708041 4 0.470653 4 \cr 



Q17|A21|D24 $ 



4 3949 

$ N4fK9|H12 $ 4 4239 4 
S T9|T12|V18|R21|Y33 $ 
$ Y4 |Q9 | Til $ 4 4402 4 
S N4|K9|R17 $ 4 4239 4 
$ N4|H12|A18 $ 
5 09|T11|L19 S 



$ S4|T9 |T12|R21 $ 4 3747 4 31.482325 4 0.471755 4 \cr 
$ N12JS30 $ & 4440 4 766.347625 4 0.479764 4 Vcr 
$ I1|S2 $ & 3970 & 345.480880 4 0.489218 4 Vcr 
$ S4|T12 |V18 |R21 $ 4 3643 4 32.472859 6 0.491921 & Vcr 
$ Q9|Tll|-23 $ 4 4299 4 742.828036 4 0.502461 4 \cr 
S T21|Q31 S 4.16089 4 12621.469597 4 0.519777 4 Vcr 
$ K9|A18 |Q31|H33 $ 4 4160 4 697.083962 4 0.520683 4 \cr 
S S4|T9 |R21|Y33 $ 4 3460 4 35.030271 4 0.528142 4 Vcr 
S Y4 |Q9 j Til j L19 | -2 3 $ 4 3425 4 1.824291 4 0.529495 4 \cr 
S 36 { K7 j T10 j Lll | M13 | K16 |G26 | Y28 $ 4 3409 4 0.000000 4 0 531288 4 
S S4|T9 |V18|R21|K31 $ 4 3406 4 1.860057 4 0.532246 4 \cr 
S S17|I19 S 4 4910 4 1510.151983 4 0.533993 4 Vcr 
5 Y12|H20|R24 $ 4 3401 4 29.556849 4 0.538702 
S S4 |T9 |V18 |R21 |K31|Y33 $ 4 3370 4 0.100690 4 
S S10|Q17 $ 4 22065 4 18738. 120311 4 0.547525 
S Alt-22|S23 $ 4 33C3 4 7.355264 4 0.553724 4 
S M13|W15|E31 $ 4 3339 4 56.771417 & 0.556389 
$ *24|*25|*26|*27}*28|*29|*30|*31|*32|*33 S 4 
$ R17|H33 S 4 31466 4 28229.156188 4 0.565421 
$ M13|W15!T13 $ 4 3501 4 360.659791 4 0.584679 4 Vcr 
$ F13j-22|S23 S 4 3123 4 6.681355 4 C 5894C0 4 \cr 
S R17|A18|T21 S 4 3190 4 8S.24S042 4 0.592593 4 Vcr 
$ N4|K9 (A18|031iH33 S 4 3143 4 55.455693 4 0 .594235 4 \cr 
S R17 | A18 | Q3 1 1 H33 $ 4 3144 4 101.027645 4 0 .604153 4 \cr 
S V1JN23| *24| *25|*26|*27|*28|*29|*30|*31|*321«33 S 4 3030 
$ A11|N12 S 4 4517 4 14S2. 835945 4 0.607916 4 \cr 
S R12jTie|A21 S 4 3150 4 134.485298 4 0.609647 4 vcr 
S S10|G23|D24 S 4 3606 4 599.551395 4 0.611461 4 \cr 
S S1|M13|W15 $ 4 3087 4 91.193028 4 0.613590 4 Vcr 
S N12|F20|K24 S 4 3202 4 213.735139 4 0.615099 4 \cr 
S K13|W15!E24 S 4 3282 4 306.430052 4 0.61763? 4 Vcr 
$ K9 J Ill( F19 |G23 S 4 4153 4 1180.595212 4 0.618272 4 \cr 
S R2(P3 |N5|N6|T7|R8|C14lP15|GiejY20|T:2|G23|l25|l26|G27| I29|R30)A3 2 $ 
5 H12|A18|Q31 S 4 3759 4 845.163446 4 0.629981 4 v cr 
$ Kl7|D20|-23 S 4 2928 4 25.438797 4 0.632234 4 \cr 
$ Y5) K7 |R10| K23 |N24 IT28 $ 4 2897 4 0.000008 4 0.633345 4 Vcr 



4 Vcr 

0.539008 4 \cr 
4 \cr 
\cr 
4 \cr 
3269 4 
& Vcr 



.000000 4 0.559020 k 



4 0.000000 4 C.60G 



3353 
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132 


1 0 


. 34 


t & 


S G10|R17 $ & 9506 4 6637.164691 4 0.63S967 4 Vcr 


133 


1 0 


. 34 


1 & 


$ Y4|Q9|-23 S 4 3539 4 699.676594 & 0.644852 4 \cr 


134 


I 0 


. 35 


I & 


$ G12|A22|D23|N24 S 4 2838 4 0.092735 4 0.645134 & \cr 


13S 


| 0 


.35 


j & 


$ T1|R2|P3|N5|N6|T7|R8|G14|P15|G16|Y20|T22II25(I26|G27|I29)R30|A32 $ 4 3787 S 


136 


i ° 


.35 


j & 


$ Tl|R2|P3iN5|N6|T7|R8|G14|PlS|G16|Y20|T22|G23|l25|I26|G27|I29[R30|A32 S 4 3C 


137 


i o 


.35 


j & 


$ N4|H12 $ 4 26775 4 23945.157075 & 0.646741 4 Vcr 


138 


i ° 


. 36 


i & 


$ V18|*24f25|*26|*27|*28|*29f30|*31|*32(*33 S 4 2775 & 0.000000 4 0.657651 


139 


j 0 


. 36 


i & 


$ A1|R9|-22|S23 S 4 2763 & 0.413224 4 0. 660115 4 Vcr 


140 


0 


.36 


i & 


$ Y6|Y12|F13|H20|A22|K24 $ 4 2761 & 0.000052 4 0.660430 4 \cr 


141 


0 


. 36 


1 & 


$ V11|A24|S28 $ 4 2788 4 31.535138 4 0.661330 4 Vcr 


142 


0 


.37 


i & 


$ V20| 122 |-24 |K25|M29 $ 4 2748 4 0.000084 4 0. 663009 & \cr 


143 


0 


.37 


i 4 


$ K9|H12|A18 $ & 3267 4 526.343072 4 0.664465 4 \cr 


144 


0 


. 37 


& 


$ T9|T12|V18|'R21f'K31 S 4 2742 & 1.602621 4 0.664517 4 Vcr 


145 


o 


37 


& 


$ T8|R9 $ & 3185 & 445.441758 4 0.664683 4 \cr 


146 


o 


38 


& 


$ I11|H12|R17 $ 4 2909 4 172.776969 4 0.665344 & \cr 


147 


o 


38 


& 


$ Y6|X10 |X12 |M13 |X18 (X19|R31 S 4 2736 4 0.000000 4 0.665388 4 \cr 


148 ! 


o 


38 | 


& 


$ A24|S28 $ & 3300 4 566.063943 4 0.665797 4 \cr 


149 j 


0 . 


38 | 


& 


$ G12 |T18 |A22 |D23 |N24 $ 4 2692 4 0.005083 4 0.674094 4 \cr 


150 1 


0 . 


39 


& 


$ P12|W19|Q24 $ 4 3054 4 395.340658 4 0.680669 4 \cr 


151 1 


o . 


39 j 


& 


$ A14|H20|N24 $ 4 2697 4 47.434702 4 0.682460 4 Vcr 


152 I 


o . 


39 | 




$ T9 |T12 | V18 | R21 1 K31 |Y33 $ 4 2632 4 0.086760 4 0.685931 4 \cr 


153 1 


o . 


39 J 


& 


5 R12|Q17|A21 S 4 2701 4 79.045229 4 0.687867 & Vcr 


154 | 


o . 


40 j 


St 


$ R17|A18|H33 S 4 3944 4 1325.820339 4 0.688628 4 \cr 


155 1 


0 . 


40 


& 


$ W15|I19|A24 $ 4 2655 4 56.335384 4 0.692552 4 \cr 


156 | 


0 . 


40 | 


& 


$ Q12)R13|V20|I22|K24|-26|M29 $ 4 2584 4 0.000000 4 0.695324 4 \cr 


157 j 


0 . 


40 | 


& 


$ Sl|Y4|-6|NlO|Yll|S12jS15|V21|K24 $ 4 2554 4 0.000000 & 0.701181 4 \cr 


158 I 


0 . 


41 | 


& 


$ T18|A21 $ 4 6883 4 4332-151205 4 0.701796 4 Vcr 


159 [ 


0 . 


41 j 


& 


$ K17|D20 S 4 2996 4 458.835571 4 0.704460 4 \cr 


160 j 


c . 


41 j 




$ Q17jD24|K31 S 4 2660 4 125.180912 4 0.704916 4 \cr 


161 I 


0 . 


41 j 


£ 


S L13|Q15|W19 S 4 2582 4 98.222466 4 0.714812 4 Vcr 


162 1 


o . 


42 | 




$ S4|T9 |R21 |K31 fY33 S 4 2474 4 1.844223 4 0.717056 4 \cr 


163 1 


0 . 


42 




$ 11 |G4 (Mil j P18 | R22 | -24 |V2S $ 4 2445 4 0. 000002 4 0.722286 4 Vcr 


164 


0 . 


42 | 




$ S12|F13 $ 4 4939 4 2S02. 178252 4 0.723857 4 \cr 


165 1 


o . 


43 | 




$ L13|Q17|K31 $ 4 2663 4 227.467572 4 0.724104 4 Vcr 


166 1 


o . 


43 | 




$ K9jR17|H33 S & 3142 4 710.504406 4 0.724879 4 Vcr 


167 | 


0 . 


43 | 


& 


$ P12|L13|W19 S 4 2907 4 483.231131 4 0.726360 4 \cr 


168 j 


0 . 


43 | 


& 


$ K9|R17|A18 $ 4 3012 4 598.308696 4 0.728290 & \cr 


169 | 


0 . 


44 j 


& 


$ S4|T12|R21 S 4 3010 4 597.264141 4 0.728473 4 \cr 


170 j 


o . 


44 | 


& 


S N4|I11|R17 $ 4 3233 4 820.559839 4 0.728529 4 Vcr 


171 | 


o . 


44 | 


& 


S M13|A24IE31 $ 4 2426 4 50.435106 4 0.735563 4 Vcr 


172 


0 . 


44 | 


& 


S L3|A12|T181V19|D23 j R24 $ 4 2374 4 0.0001C4 4 0.735861 4 \cr 


173 | 


0 . 


45 | 


& 


$ K9|A21|H33 $ 4 3269 4 897.012220 4 0.736243 4 Vcr 


174 | 


0 . 


45 I 


4 


$ R2|P3|N5|NfiiT7lR8|G14|F15|Ci6|F19|Y20|T22lG23|I25|I26lG27|r.29|R30jA32 i 4 * 


175 | 


0 . 


45 J 


4 


S R10|X11|S12 jV25 $ 4 2345 4 0.448221 4 0.741446 4 \cr 


176 | 


0 . 


4b j 


& 


$ N4 |K9 j 111 |G23 $ 4 2883 4 541.944923 4 0 742108 & \cr 


177 1 


0. 


46 j 


& 


$ R1*;|A13|Q31 S 4 33C4 4 973.769538 4 0.744153 4 Vcr 


173 j 


0 


46 | 


4 


S Y4|Q9|T11|F13 |Y19|-23 $ 4 2321 4 C.C09829 4 0.74589b 4 Vcr 


179 | 


0 


46 | 


4 


$ I7|F20|Q33 $ 4 2355 4 3e. 678004 4 0.746775 4 Vcr 


180 | 


0 


46 1 


u 


3 T9|V18|K31|Y13 $ 4 2352 4 43.522103 i 0.748251 4 \cr 


181 ! 


0 


47 : 


4 


$ L3 |A12lV19 |D23 |R24 $ 4 2307 k 0.001890 4 0.748529 4 Vcr 


182 


0 


47 


4 


3 G4|M11|P1S 3 4 2306 4 12.419975 4 C. 751048 4 Vcr 


183 


0 


47 


4 


$ S4 (T12 j VI 8 | R2 1 J Y3 3 $ 4 2292 4 1.757250 4 0.751673 £ Vcr 


184 


0 


.47 


4 


3 H12|R17lT21 S 4 2417 4 129.651999 4 0.752215 4 Vcr 


185 


0 


.48 


4 


S R10|S12|W19 |Q24 S 4 2299 4 14.238983 4 0.752700 4 Vcr 


186 


; o 


.48 


4 


$ D4(E6lI7 |L11 1C12 |V13 |V20|A22|T24 1 -25| A26 |T29|Q33 £ 4 2279 S O.CO00O0 * C 71 


187 


I o 


.48 


1 & 


$ G10I724 S 4 2727 & 449.008967 4 0.753966 4 \cr 


188 


1 0 


.48 


1 4 


$ V19jR24|V26|L31 S 4 2272 4 0.404088 4 0.755161 4 Vcr 


189 


1 c 


49 




$ Vll (R121Q17 I,ri8 $ 4 2281 4 11.909386 4 0 .755629 4 Vcr 


190 


1 0 


. 49 


i 


5 U3 |W15lQ24 |E28 3 4 2270 4 0.994080 4 0.755644 4 Vcr 


191 


i o 


.49 


i & 


S T1|R2| P3}N5|N6 tT7lR8|G14 j PIS (G16|F19lY20 (T221G23 |I25 [ 126 |G27| 129 |R30|A32 S 


192 


i o 


.49 


1 * 


S R17|T21|E24 S 4 2366 4 123.762808 4 0.760627 4 Vcr 


193 


1 0 


. 50 


1 & 


S M13|W151N28 S 4 2610 4 372.687253 4 0.761541 4 Vcr 


194 


1 o 


. 50 


1 & 


3 M13|K17|V26 S 4 2455 4 218.504333 4 0.761692 4 Vcr 


19S 


1 c 


. 50 




S M13iQ15lG24 5 4 2335 4 100.386181 4 0.761856 4 Vcr 


196 


1 0.51 


| 4 $ F19JG23|D24 S 4 3105 4 885.799386 4 0.764893 4 Vcr 


197 




) .51 


| 4 S M11|I1S|G18!Q19|T20|F21|H221A24 S 4 2218 4 0. 000000 4 0.765115 4 \cr 
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198 I 0 51 I & S R9|A14|H20|N24 S 4 2214 & 2.664734 & 0.766345 4 \cr 

199 0 51 I & $ S4|T9|V18(K31 $ 4 2253 & 45.367685 & 0.767028 & \cr 

200 I 0 52 I & $ Y4|Tll|L19]-23 S & 2222 & 36.549243 & 0.771106 & \cr 

201 | 0 52 | 4 S K9|A18iH33 S 4 9877 4 7701.775158 4 0.772980 4 \cr 

202 I 0 52 | 4 $ T9|S18|H20 $ 4 2246 4 73.652930 4 0.773506 4 \cr 

203 I 0 52 | 4 $ G10|S17|I19 S & 2252 4 85.340274 4 0.774546 4 \cr 

204 I 0 53 I 4 S T12|F13|A14 S 4 2217 4 52.732117 4 0.774983 4 \cr 

205 | 0 53 | 4 $ N4|A21|H33 $ 4 3446 4 1316.842768 4 0.781367 4 \cr 

206 | 0 53 I 4 S N9|R10|S12|H2O|K23|O24 S 4 2120 4 0.000164 & 0.783023 4 \cr 

207 I 0 S3 | 4 S R12|T18|D24 S 4 2373 4 258.082710 4 0.783941 4 \cr 

208 I 0 54 | 4 $ T21JH33 S 4 13722 4 11609.274530 4 0.784336 4 \cr 

209 I 0.54 | 4 S T9|V18|R21 $ 4 2733 4 627.S01151 4 0.785638 4 \cr 

210 I 0.54 | 4 $ L13|K17|V19 $ 4 2223 4 123.584647 4 0.786733 4 \cr 

211 I 0 54 I & $ T9IV18IY33 $ 4 2928 4 837.332414 4 0.788304 4 \cr 

212 0 55 4 $ E1|Q4|I5|D6|I7|Q8|E9|-101M11|M16|A17|-18|W19|S21|M22[I24|G25|G26|T27|S28|S2 

213 I 0.55 | 4 $ Y5|K7|K23|N24|T28 $ 4 2075 4 0.000148 4 0.791109 4 \cr 

214 I 0 55 | 4 $ S4|W1S| I19|A24 S 4 2071 4 2.211998 4 0.792396 4 \cr 

215 I 0.55 | 4 S N4|K9|A18|Q31 S 4 2414 4 350.433693 4 0.793149 4 \cr 

216 I 0.56 | 4 S X12|L13|N24 S 4 2091 4 32.654961 4 0.794078 4 \cr 

217 I 0 56 | 4 $ S8(R10|S12|R20|-23|K24 $ 4 2056 4 0.001995 4 0.794496 4 \cr 

218 I 0.56 I 4 $ S10|A21|D24 5 4 2195 4 141.715389 4 0.794978 4 \cr 

219 | 0.56 | 4 S T5|K6|K7tl8|H10}G24|M26 S 4 2049 4 0.000000 4 0.795739 4 \cr 

220 | 0 57 | 4 S T9(V18|K31 S 4 2861 4 816.350692 4 0.796S10 4 \cr 

221 I 0 57 | 4 S I20|A22 |T23|K24 S 4 2040 4 0.135518 4 0.797358 4 \cr 

222 1 0.57 | 4 $ Y3|A4|-21|N23 S 4 2039 4 0.001752 4 0.797511 4 \cr 

223 I 0.57 | 4 $ G4|W11|D23 JG24 S 4 2039 4 0.335758 4 0.797570 4 \cr 

224 I 0 58 j 4 $ I11|E24 $ 4 4624 4 2601.997572 4 0.798748 4 \cr 

225 I 0 58 j 4 S G10|G17|G24 $ 4 2157 4 138.303116 4 0.801095 4 \cr 

226 I 0 58 I 4 S Y6|S8|R10|A15|R16|K22|K24 $ 4 2011 4 0.000000 4 0.802448 4 \cr 
22" I 0 59 I 4 S S4IT9|R21|K31 S 4 2043 4 34.105157 4 0.802818 4 \cr 

228 I 0 59 | 4 $ D4|E6|I7|R9|L11|Q12|V13|V20|A22|T24|-25|A26|T29|Q33 $ 4 1999 4 0 . 00C000 4 

229 I 0 59 | 4 $ Q9|Tll|L19(-23 |K24 S 4 1990 4 1.569195 4 0.806400 4 \cr 

230 I 0 59 | 4 $ S8|P12|X24 $ 4 2000 4 16.301963 4 0.807225 4 \cr 

231 | 0 60 | 4 $ S4|T9|T12|R21| Y33 $ 4 1985 4 1.703737 4 0.807295 4 \cr 

232 I 0 60 I 4 $ R10|Y12|V19|024|R31 $ 4 1982 4 0.043977 4 0.807529 4 \cr 

233 I 0 60 I 4 S T4|09|F20|K231G24 $ 4 1979 4 0.004973 4 0.303044 4 \cr 

234 I 0 60 I 4 $ L11|S12|L13|V26 $ 4 1972 4 4.533048 4 0.810047 4 \cr 

235 I 0 61 I 4 $ T5|K6|K7|I8|K9|H10|G24|M26 S 4 1967 4 0.000000 4 0.810128 4 vcr 

236 | 0 61 | 4 S S6jK7|T10|Lll|K16lG26|Y28 S 4 1956 4 0.000000 4 0.812033 4 \cr 

237 I 0 6 1 I 4 S T9 I V13 | R2 1 | Y3 3 S & 19S3 4 33.839576 4 0.813214 4 \cr 

238 j 0.61 | 4 $ R2|P3|N5|N6tT7|R8|G14|P15|G16iY20iT22|G23|I2b|I26lG2?|O28!l29|R30|A32 S * 

239 ( C 62 i 4 S F19lA21|D24 3 4 2034 4 139.045764 & 0.813940 4 \cr 

240 t 0 62 I 4 S L1|M13|W15 $ 4 1949 4 9.905173 4 0.814948 4 \cr 

241 j 0 62 | 4 * Q9jTli:Q12 (L19 |F20|K22|T23tR24 $ 4 1933 4 0.000000 4 0.815996 4 \cr 

242 ! 0 62 | 4 S F19|A2I|G23 $ 4 4336 4 2404.^25279 4 0.816257 4 \cr 

243 i 0.63 | 4 S H12|RJ7tE24 S 4 2006 4 91.386294 4 0.819143 4 \cr 

244 1 0 63 t 4 S L13|W15|V26 S 4 2149 4 237.129335 4 0.819611 & \cr 

245 1 0 63 1 4 $ N12!W19|-23 |N24 S U 1909 4 7.808753 4 0.821430 * \-r 

246 j 0*.63 1 4 $ Tl|R2|P3|NS|N6jT7|R8|G14|P:5 t Gl5lY2C|T22;i25|G27jl2^|R3C!A32 S & 4991 4 3 = 

247 1 0.64 | 4 S T21IQ24 S i 77*73 4 5882.829451 4 0.823300 4 \cr 

248 I 0 64 1 4 $ G41V11|R12 S 4 2497 4 603. 660368 4 0.823510 4 \cr 
24Q t 0 64 | 4 S Q17|K31(Y33 $ & 2149 4 263.756833 4 0.824134 4 \cr 
250 ' 0 64 | 4 $ M25|K26 $ 4 2096 4 217.906699 4 0.825341 & Vcr 

-»51 1 0 65 | 4 S T21|Q31|H33 S 4 2236 4 361.S82066 4 0.825961 4 \cr 

252 j 0 65 | 4 S T12|V18|R21 S 4 2446 4 576.530816 4 0.826794 4 \ct 

253 I 0 65 | 4 S Y4|Q9|Tillul9l-23|N24 $ 4 1869 4 0.047981 4 0.826881 4 \cr 

254 | 0 65 1 4 S H12|Q?1|H32 $ 4 2S22 i 1055.347497 4 0.827256 4 \cr 

255 I n *6 | 4 S R9 |Mll 1 115 (CI S i Q19 | T20 | F2 1 1 K22 | A24 $ 4 1365 4 0.000000 4 0.827546 4 \cr 

256 1 C 66 | 4 S Vl|T18|H23f24i*25|*26|*27|-28|*29l*30f31{*32|-33 S 4 1659 4 0 . 0000OC 

257 | 0-66 j 4 $ G4|L13|W151A24 S 4 1866 i 7.095250 4 0.828568 4 \cr 
2^6 ; 0.66 I 4 S P12|T21 S 4 11256 & 9405.119546 4 0.829912 4 vcr 

259 t 0.67 1 4 S T9tK31jY33 5 4 2S69 4 323.122139 & 0.830747 4 \cr 

260 1 0 67 | 4 $ K7|Al4iA24 (V32 3 4 1844 4 0.130705 4 0.831083 i \cr 

261 i 0.67 1 4 3 Q9iTlllI19r-22lT23|K24tV25lV26 5 4 1841 4 0.000000 4 0S31S61 4 \cr 

262 | 0 68 i 4 S R9|K17|G24 S 4 2157 k 318.307130 4 0.831945 4 \cr 

263 I 0.68 1 4 S A14|G24 S 4 5018 4 3183.960671 4 0.832719 4 \cr 
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264 


| 0.68 


1 & 


$ S8|R10|S12 |V20|A22|R23 $ & 1834 & 0.000248 4 0.832726 4 Vcr 


265 


f 0. 68 


i & 


S L11|S12|V26 $ 4 1919 4 85.184287 4 0.832757 4 Vcr 


266 


| 0.69 


j & 


$ R2 | P3 |N5|N6 |T7 | R8 |S10|G14 | P15 |G16 j F19 |Y20 |T22 JG2 3 1 12 5 | 1 26 | G2 7 | 129 \ R3 0 | A3 2 S 


267 


[ 0.69 


i & 


S R12|T18|R31 $ 4 2336 4 510.426023 & 0.834125 4 Vcr 


268 


| 0.69 


i & 


$ V11|R12|D24 $ 4 2089 & 265.729908 4 0.834506 4 \cr 


269 


! 0.69 


& 


$ H12|A18|T21 $ 4 1899 4 83.244349 & 0.835749 4 Vcr 


270 


0. 70 


& 


$ R12|D24 $ 4 12816 4 11016.724531 4 0.838463 & \cr 


271 


0.70 


& 


$ D5|Q24|N28 S 4 1856 4 57.570055 & 0.838602 4 Vcr 


272 


0.70 


& 


S V11|A21 $ 4 6489 & 4695.336767 & 0.839384 4 \cr 


273 


0. 70 


& 


$ T9|T12|R21 $ & 2344 & 558.736361 4 0.T0758 & \cr 


274 


0.71 


& 


$ G10|L13|W19|Q24 $ 4 1805 & 21.432425 4 0.841035 4 Vcr 


275 


0.71 


& 


$ S4|X10 $ & 2670 Sl 889.423177 4 0.841523 4 Vcr 

$ Q9jLi9j-23jN24|V25 $ & 17ei 4 0.559631 £ 0.841545 a \cr 


276 


0.71 


& 


277 


0.71 


& 


$ N12|W19|N24 $ & 1920 & 140.804894 & 0.841748 4 \cr 


278 


0.72 


& 


S H4|R9|T12 $ & 1843 & 63.837063 & 0.841754 & \cr 


279 j 


0.72 


& 


$ RIO | S12 j S19 1 Q24 $ &. 1775 4 2.690479 & 0.842869 & \cr 


280 | 


0.72 | 


& 


S M13|K17|T18 $ & 2153 4 386.143038 4 0.843755 4 Vcr 


281 | 


0.72 ! 


& 


$ R17|T21|Q31 $ 4 1850 4 91.352915 & 0.845085 4 \cr 


282 | 


0.73 | 


& 


S S8|X24 $ 4 2047 4 293.960536 4 0.845991 4 Vcr 


283 | 


0.73 | 


& 


$ Y4 |09 |R10 |T11|L19 |-23|R24 S 4 1749 4 0.003580 4 0.846644 4 \cr 


284 | 


0.73 | 


& 


$ Il|E3|l4|A5|E6|V7|Q8|D9i-10|Y12 | Tl 2 | M16 | - 18 | W19 | R20 | S2 1 | M22 | L2 3 j K24 | R2S | S2 6 


285 j 


0.73 ( 


& 


S D5|Q24 S & 2759 & 1018.937453 4 0.848081 4 \cr 


286 j 


0.74 | 


& 


$ T1|R2|P3|N5|N6|T7|R8|G14|P15|G16|Y20|T22|I25JI26|G27|D28|I29|R30|A32 S 4 1' 


287 | 


0.74 | 


Sl 


$ K9|H12|R17 S 4 1894 4 166.406797 4 0.8S0079 4 \cr 


288 | 


0.74 [ 


4 


$ V11|017|D24 $ 4 1812 4 93.103584 & 0.851467 4 \cr 


289 | 


0.74 j 


4 


S A1|M11 $ 4 2729 & 1013.253017 4 0.851968 4 \cr 


290 | 


0.75 f 


4 


$ H12 |A18|Q31|H33 $ 4 1795 4 85.511085 4 0.852963 4 Vcr 


291 | 


0.75 | 


4 


3 L19|S22|V26 $ 4 1757 t 49.527891 4 0.853283 4 Vcr 


292 I 


0.75 t 


A 


$ R9|N12)M13 $ & 2146 4 444.904483 4 0.854292 4 Vcr 


293 1 


0.76 | 


4 


S Q6|M13 |W15|H20|N22 $ 4 1695 4 0 . 0G5297 4 0.855256 & \cr 


294 | 


0.76 j 


4 


S Y4(T11|L19 $ 4 2355 4 661.700054 4 0.855524 4 Vcr 


295 | 


0.76 | 


4 


$ I19|-23|V25 $ 4 1827 4 134.704350 4 0.85S682 4 Vcr 


296 | 


0.76 | 


4 


$ T9|V18|R21|K31|Y33 $ 4 1692 4 1.781938 4 0.856009 4 Vcr 


297 j 


0.77 j 


4 


S S151-21 1-22 |-24 S & 1664 4 0.060460 4 0.860125 4 Vcr 


298 | 


0.77 j 


4 


$ X12|N24 $ 4 2272 4 614.028472 4 0.861054 4 Vcr 


299 j 


0.77 | 


4 


$ A1|M11|T18 $ 4 1713 4 55.464572 & 0.861122 4 Vcr 


300 j 


0.77 | 


4 


S V20|I22|-24|K25(N28|H29 $ 4 1657 4 0.000005 4 0.861205 4 Vcr 


301 | 


0.78 | 


4 


S M9|N28 $ 4 2C94 4 448.507779 4 C. 863034 4 Vcr 


302 | 


0.78 | 


4 


S A21|Q31|H33 S 4 3220 4 1574.859557 4 0.863042 4 Vcr 


303 i 


0.78 | 


4 


S Q12|R13|V20|X22lK21|-26|N28|M2* S 4 1645 4 0.000000 4 0.863064 4 vcr 
S L3jT9 $ 4 3676 4 20:i.224C46 4 0 . 86J083 4 Vcr 


304 | 


0.78 \ 


4 


305 i 


0.79 | 


& 


S D24IK31 S 4 12967 4 11324.565776 4 0.863460 4 \cr 


306 j 


0 79 ! 


4 


3 L13|K31|Y33 S 4 2465 4 827.R71734 4 0.864276 4 Vcr 


307 j 


0. 79 j 


4 


S L19|T2l|-23 S 4 1949 4 312.399933 4 0.864436 & Vcr 


3C8 | 


0. 79 | 


Sl 


S G4|E9|R10|S12|I20|R22|Q24 3 4 1633 4 0.000021 4 0.S64913 & \cr 


309 | 


0.80 | 


ft 


$ G4}W1S|A24 $ 4 1765 4 1J3.J49343 4 0.865121 4 Vcr 


310 j 


o.eo j 


4 


$ Tl | R2 | P3 | N5 | N6 |T7 | RS |G14 1 P15 { G16 | Y20 |T22 |G23 | 125 | 126 iG27iD28j 12^1 R30|A32 S 


311 | 


0.80 i 


4 


$ N4|R17|A18 $ 4 2464 4 833.718080 4 0.865331 4 Vcr 


312 I 


O.dO j 


Sl 


$ Y4|T5|K6|H9 j -10 l-llj-12 i- 13| Rl 4 | A15 j C16 |G17 ( R18 | A19 | V/20 | W21 |T23|024 1T26 i 


313 | 


0.81 ! 


Sl 


S M13jSl7|I19 $ 4 1697 4 69. 061642 4 0.865691 4 Vcr 


314 


0.81 


4 


$ I13!R17|031 S 4 285' 4 1234.2C8538 4 0.866479 4 Vcr 


315 


0 .81 


4 


S SltVil|V25 5 4 1725 4 114.913068 4 0.868418 4 Vcr 


316 


0 .31 


Sl 


$ Q9|L19|-23 $ 4 236' 7 f. 7SP.3758SC 4 C. 868641 4 Vcr 


317 


0 . 62 


4 


$ A12|V19|R24 $ 1625 4 17.183947 4 0.868764 4 Vcr 


31S 


1 0 . 82 


& 


S X8 j P9 |X13 |X31 S 4 1606 4 0.000107 4 C. 869040 4 Vcr 


319 


| 0.82 


4 


$ P12 |A14 |S17jW19jF20 $ 4 16C5 4 C. 078275 4 0. 869203 4 Vcr 
S El |KS|T12' ; D20iY22 (G24 $ 4 1602 4 0.000007 4 0.86S647 4 Vcr 


320 


t 0.82 


I & 


321 


1 0.83 


1 & 


$ TlllL19|-23 S 4 2366 4 770.1363G5 4 0.870576 4 Vcr 


322 


j 0.3 3 


i & 


$ Il|R9|ri2 5 4 1615 L 19.398937 4 0.870616 4 Vcr 


323 


j 0.83 


i & 


S H3(R12 3 4 2021 4 425.578936 4 0.870643 4 Vcr 


324 


| 0.84 


i & 


S I1|I12 S 4 1939 h 345.480880 4 0.870930 4 Vcr 


325 


i 0 .84 


i & 


S R9|S15| -21|-22',-23l-24 S 4 1592 4 0.000188 4 G. 871160 4 Vcr 


326 


( 0.84 


I £ 


S A12|T18|V19|R24 $ 4 1583 4 0.942061 4 0.872S57 4 Vcr 


327 


I o .e4 


t 4 S W15|Q24|E2B S 4 1596 4 18.677&70 4 0,873368 4 \cr 


328 


| 0.85 


| 4 S Y12 (V19 iQ24|R31 $ & 1578 4 0.826255 4 0.873390 4 \cr 


329 


( 0.85 


1 4 $ EX|G4|l5iD6|l7|Q8|E9|-10|Ml6|A17l-:9lW19|S2i|M22|L24|G2S|G26|T27lS28|S29iA3 
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HIV output: 6/6 

330 | 0.85 | 4 S G8|L13|A14|G18|H20 $ 4 1575 4 0.000857 4 0.873716 4 \cr 

331 | 0.8S | 4 S M1X|R12|T18 $ 4 1784 & 214.819937 & 0.874586 * \cr 

332 | 0.86 | 4 S Y4|I8|Q9|T11|Y19|I23|S24|V25 S 4 1569 4 0.000000 4 0.874613 4 \cr 

333 | 0.86 | & $ S4|K7|T9|A14|A24|V32 $ & 1568 & 0.000391 & 0.874763 & \cr 

334 | 0.86 | & $ G10|V19|R24|V26|L31 $ 4 1567 & 0.022878 & 0.874915 4 \cr 

335 | 0.86 | 4 $ A18|T21|H33 S & 1950 4 386.091755 4 0.875373 4 \cr 

336 | 0.87 | & S T9|V18|R21|K31 $ 4 1595 4 32.945102 4 0.875649 4 \cr 

337 | 0.87 | 4 $ N12|N28|E31 $ 4 1644 4 84.308230 4 0.876001 4 \cr 

338 | 0.87 | 4 $ N4|K9|F19|G23 $ 4 2413 4 853.445657 4 0.876021 4 \cr 

339 | 0.87 | 4 $ S15|-21|-22|-23 |-24 $ 4 1550 4 0.003354 4 0.877439 4 \cr 

340 | 0.88 f 4 $ V15|P18|R21|V23JV26 $ 4 1543 4 0.000396 4 0.878473 4 \cr 

341 | 0.88 | 4 $ V11|R12|A21 $ 4 1677 4 139.0S6915 4 0.879218 4 \cr 

342 | 0.88 j 4 S S4|K31|Y33 S 4 2429 4 891.874986 4 0.879339 4 \cr 

343 | 0.88 | 4 $ Y4|Q9|T11|L19 $ 4 1566 4 32.897-928 4 0.879930 4 \cr 

344 | 0.89 | 4 $ A14|S17|W19|F20 $ 4 1534 4 1.410979 4 0.88000S 4 \cr 

345 | 0.89 | 4 $ G4|R9|F20|T26 $ 4 1540 &. 12.834655 4 0.880800 4 \cr 

346 | 0.89 | 4 $ Y6|X10|X12|X18|X19|R31 $ 4 1525 4 0.000000 4 0.881117 4 \cr 

347 | 0.89 | 4 $ L1|S4|M13(W15 $ 4 1525 4 0.559824 4 0.881199 4 \cr 

348 | 0.90 | 4 $ N4|K9|A21|H33 S 4 1568 4 48.227041 4 0.881881 4 \cr 

349 | 0.90 | 4 $ T9|V11|R22 $ 4 1769 4 253.858213 4 0.882556 4 \cr 

350 | 0.90 | 4 S Y6|G8|R10|Lll|S12|V20fR23|K24 S 4 1515 4 0.000000 4 0.882576 4 \cr 

351 | 0.90 | 4 S X4|K31 $ 4 1926 4 418.267274 4 0.883632 4 \cr 

352 | 0.91 | 4 $ P12|D23|N24 $ 4 1623 4 115.955006 4 0.883732 4 \cr 

353 | 0.91 | 4 S Q9|K23 S 4 4487 4 2986.952760 4 0.884744 4 \cr 

354 ! 0.91 | 4 $ G4|R9|M13|W15 S 4 1511 4 14.154263 4 0.885206 4 \cr 

355 1 0.91 ( 4 $ R2|P3|N5|N6|T7|R8|I13|GI4|P15|G16|F19|Y20|T22|G23|I25)I26|G27|I29|R30IA32 S 

356 | 0.92 | 4 S V13|W15|L19 $ 4 1573 4 83.913345 4 0.S86323 4 \cr 

357 | 0.92 | 4 $ P12|S30 S 4 2786 4 1298.604637 4 0.886566 4 \cr 

358 | 0.92 | 4 S Vl|R12|T18|N23f24|*25|*26|*27|*28|*29!*30|*3H*32|'33 S & 1487 4 C. 000000 

359 | 0.93 | 4 $ 017|D24|Y33 $ 4 1608 4 121.315232 4 0.866668 4 \cr 

360 | 0.93 J 4 $ E9|R12|T18 $ 4 1614 4 133.600703 4 0.887568 4 \cr 

361 | 0.93 | 4 $ G4|R12|T18 $ 4 2078 4 597.832556 4 0. 887601 4 \cr 

362 | 0.93 | 4 $ H4|P12 $ 4 2777 4 1298.604637 4 0.887855 4 \cr 

363 j 0.94 | 4 S T1|R2|P3|N5|N6|T7|R8|G14|G16[Y20|T22|G23|I25|I26!G27|I29|R30|A32 S 4 1925 4 

364 | 0.94 | 4 $ S4|T9|N12|V13|R21 $ 4 1474 4 1.1S8008 4 0.888647 4 \cr 

365 ( 0.94 | 4 $ W19)X24|T26 $ 4 1724 4 252.469231 4 0.888834 4 \cr 

366 i 0.94 | 4 S A1|E9 $ 4 2089 4 630.489942 4 O.S90631 4 \cr 

367 j 0.95 | 4 $ Al |G4 |F20f A24 S 4 1455 4 2.044033 4 0.891465 4 \cr 

368 | C.95 | 4 $ T9|C12 j -22|G24 S 4 1450 4 0.214607 4 0.891912 4 \cr 

369 | 0.95 1 4 $ Y4|09|T11|Y19|W20| -23fN24 $ 4 1447 i 0.000061 4 0.892304 i \cr 

370 | C.95 ! 4 S G10|M13 |A24 | £31 S 4 1446 4 2.855158 4 C8S28«S 4 \cr 

371 | 0.96 | 4 $ S10|D24|I26 S 4 2229 4 789.361165 4 0.893336 4 \cr 

372 | 0.96 1 4 S G4|M13|W15 S 4 1691 4 252.133281 4 0.393444 4 \cr 

373 | 0.96 | & $ N12|E31 $ 4 2929 4 1492.835945 4 0.893322 4 \cr 

374 J 0.96 i 4 $ T12|F12 |A14 |N28 S 4 1436 4 2.983773 u 0.894262 i \cr 

375 j 0.97 | 4 S S4 | T12 j V18 1 R2 1 1 K3 1 $ 4 1434 & 1.7J065S 4 C. 894363 4 \cr 

376 ! 0.97 i 4 $ S8!G10jX24 S 4 1444 4 16.637502 4 0.895049 4 \cr 

377 ! 0.97 | 4 S Q9 j Tl 1 1 L19 | - 2 3 | K24 | R3 1 S i 1427 4 0.051495 4 0.S951G6 4 \cr 

378 i C.97 f 4 S M13|S17|N*28 S 4 1661 4 239.34^729 4 0.395842 * \c 

379 | 0.98 | 4 S R10!Y12|V19|E23IQ24 S u 1420 4 0.021913 4 C. 8960^4 i \cr 

380 | 0.98 | 4 S R12II13 $ 4 2328 4 90S. 283714 4 0.396110 4 \cr 

3ei J 0.98 1 4 S G10) I20tA22]T23 |K24 $ 4 1415 4 0.007673 A, 0.396763 i \cr 

382 | 0.98 \ 4 S 09|V18|K23 S 4 1572 4 162.149493 L 0.897472 & vcr 

383 | 0.99 i 4 S R17|T21|H33 S 4 i486 4 82.159457 4 0.898299 & \cr 

384 J 0.99 ! 4 > T9|S18|K30 $ 4 1425 4 25.486414 4 C . 898S92 & \ci 

385 | 0.99 | 4 5 G8 |A14 |G18 |H20 S 4 1399 4 0.016110 4 0 . 8989S4 4 V;r 

386 | 0.99 ! 4 $ T12|S15|H20ll22jE23|iC24 $ 4 1393 4 0.000820 4 0.09978:. 4 \ci 

387 ! 1.00 | 4 S IljY4!-22|S23|R24 $ 4 1393 4 0.040715 4 0.899786 4 \cr 
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APPENDIX D 



$fm = $ARGV[0] ; 

open (IN, $fm) ; 
eprob = <IN>; 
chop @prob; 
close (IN); 



File probsortpl: 1/ 1 



@prob = grep (/cr/, @prob) ; 

open (TEMP, "> probsort. temp" ) ; 

foreach (@prob) 
( 

print TEMP •$_\n"; 

i 

Close (TEMP); 

# exit; 

S f m ~ $ f m . ' . prob " ; 

# print "fm: $fm\n*; 

'sort -o prob.tmp -n01234567890 . + 9 probsort . temp ' ; 

* rm probsort . temp * 

open (IN, "prob.tmp"); 
&m2 = <IN>; 
chop @m2 ; 
close (IN) ; 

* rtn prob . tmp * ; 

open (TEMP, "> $fm" ) ; 

Stotal = scalar <?m2 ; 
Si = C; 
foreach (@m2) 
( 

printf TEMP »%3d J % . 2 f 1 %s\n" . Si, (Si / Scocal). S_ 
Si-*; 
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APPENDIX E 



A[GAP43 RATGAP43] 
D[nAChRd RNZCRD 1 ] 
A[PTN RATHBGAM] 
D[Insl RNINS1] 
B[cjun RNRJG9] 
AfCCOl RATMTCYTOC] 
A[DD63.2 (I I)] 

A[GAP43 RATGAP43] 
B[nAChRa4 RATNARAA] 
A[CC02 RATMTCYTOC] 

A[GAP43 RATGAP43] 
C[mGluR2 RATMGLURB] 
BfFGFR RATFGFR1] 
B[SOD RNSODR] 

B[NMDA2D RNU08260] 
B[EGF RATEPGF] 

B[G67I80/86 RATGAD67] 

A[MAP2 RATMAP2] 
A[synaptophysin RNSYN] 
BfChAT (*)] 
A[GRa2 (|)] 
A[GRb3 RATGARB3] 
C[mGluR8 MMU 17252] 
D[nAChRa6 RATNARA6S] 
A[trkB RATTRKB1] 
A[PTN RATHBGAM] 
B[IGF II RATGFI2] 
A[H2AZ RATHIS2AZ] 
A[TCP (I I)] 

A[CC02 RATMTCYTOC] 
A[DD63.2 (I I)] 

D[GFAP RNU03700] 
A[NT3 RATHDNFNT] 
C[PDGFb RNPDGFBCP] 
C[cfos RNCFOSR] 

Ffcellubrevin s63830] 
BflnsR RATINSAB] 

A[GAD67 RATGAD67] 
B[5HT2 RATSR5HT2] 

F[cellubrevin s63830] 
B[InsR RATINSAB] 

A[nestin RATNESTIN] 
B[CNTF RNCNTF] 



A[ODC RATODC] 
DfnAChRe RNACRE] 
B [FGFRRATFGFR 1 ] 
A[cyclin A RATPCNA] 
A[TCP (I I)] 

A[CC02 RATMTCYTOC] 



A[GAD65 RATGAD65] 
BfFGFR RATFGFR1] 



FfNFM RATNFM] 
B[nAChRa4 RATNARAA] 
A[IGF I RATIGFIA] 
A[CC02 RATMTCYTOC] 

D[nAChRe RNACRE] 
B[TGFR RATTGFBIIR] 

A[SOD RNSODR] 

A[GAP43 RATGAP43] 
A[neno RATENONS] 
AfODC RATODC] 
A[GRa3 RNGABAA] 
A[GRg3 RATGABAA] 
B[NMDA2B RATNMDA2B] 
D[nAChRd RNZCRD 1] 
A[CNTFR S54212] 
BfFGFR RATFGFR1] 
A[IP3R2 RNITPR2R] 
B[cjun RNRJG9] 
F[actin RNAC01] 
A[SC1 RNU19135] 



D[GRb2 RATGARB2] 
BfCNTF RNCNTF] 
B[PDGFR RNPDGFRBE] 



D[G67I86 RATGAD67] 



C[mGIuR6 RATMGLUR6 . ] 
B[Ins2 RNINS2] 

D[mGluR6 RATMGLUR6.] 
A[SC2RNU19136] 

B[TH RATTOHA] 
BfEGF RATEPGF] 



B[nAChRa4 RATNARAA] 0.00000 

AfCNTFR S54212] 

BfTGFR RATTGFBIIR] 

A[H2AZ RATHIS2AZ] 

Ffactin RNAC01] 

A[SC1 RNU19135] 



A[GRg2 (#)] 
B[cjun RNRJG9] 



0.00000 



A[G67I80/86 RATGAD67] 0.00000 
B[nAChRa5 RATNACHRR] 
Bfcjun RNRJG9] 



C[mAChR4 RATA CHRMD ] 0.00000 



B[SC7 RNU19141] 0.00000 

B[L1 S55536] 0.00000 
F[GAT1 RATGABAT] 
B[NOS RRBNOS] 
A[GRa5 (#)] 

B[mGluR3 RATMGLURC] 
B[nAChRa4 RATNARAA] 
B[5HTlb RAT5HT1BR] 
A[MK2 MUSMK] 
D[Insl RNINS1] 
Afcyciin A RATPCNA] 
BfBrm (I I)] 

AfCCOl RATMTCYTOC] 
D[SC6RNU19140] 



D[NMDA2C RATNMCA2C] 0.00000 
D[bFGF RNFGFT] 
A[cyclin B RATCYCLNB] 



B[IGF I RATIGFIA] 



0.00000 



C[mAChR3 RATACHRMB] 0.00000 



D[5HT3 MOUSE5HT3] 0.00000 



C[mAChR4 RATACHRMD] 0.00000 



- 156 - 



SUBSTITUTE SHEET (RULE 2B) 



BNSDOCID: <WO 98431 82A1_IA> 



WO 98/43182 



PCT/CA98/00273 



A[nestin RATNESTIN] 
A[MK2 MUSMK] 

A[ODC RATODC] 
C[NGF RNNGFB] 
A[MK2 MUSMK] 
D[Insl RNINS1] 
A[H2AZ RATHIS2AZ] 
Ffactin RNAC01] 
A[DD63.2 (I I)] 

A[GAP43 RATGAP43] 
B[nAChRa4 RATNARAA] 
B[FGFR RATFGFR1] 

B[TH RATTOHA] 
BrBrm (II)] 

D[mGIuRl RATGPCGR] 
AfEGFR RATEGFR] 

F[NFL RATNFL] 
D[nAChRa2 RATNNAR] 
A[SC2 RNU19136] 

D[MOG RATMOG] 
D[mGluR4 RATMGLUR4B] 
A[IGFR2 MMU04710] 

A[GAP43 RATGAP43] 
B[nAChRa5 RATNACHRR] 
B[cjun RNRJG9] 

A[cellubrevin s63830] 
A[CRAF RATRAFA] 

B[keratin RNKER19] 
B[CNTF RNCNTF] 



B[TH RATTOHA] 
B[IGF II RATGFI2] 

D[nAChRd RNZCRD1] 
D[trk RATTRKPREC] 
A[PTN RATHBGAM] 
B[IGF II RATGFI2] 
B[Brm (I I)] 

A[CCO! RATMTCYTOC] 



F[NFM RATNFM] 
B[nAChRa5 RATNACHRR] 
B[cjun RNRJG9] 

A[MK2 MUSMK] 



D[mGluR4 RATMGLUR4B] 
AflGFRl RATIGFI] 

D[mGluR4 RATMGLUR4B] 
D[5HT3 MOUSE5HT3] 



B[GRal (#)] 

D[nAChRa2 RATNNAR] 
C[IP3R3 RATIP3R3X] 

F[NFM RATNFM] 
B[FGFR RATFGFR1] 
A[CC02 RATMTCYTOC] 

A[GRbl RATGARB1] 
B[IP3R1 RATI145TR] 

A[celiubrevin s63830] 
A[IGF I RATIGFIA] 



C[NGF RNNGFB] 0.00000 
B[Brm (I I)] 

DfnAChRe RNACRE] 0.00000 

A[CNTFR S542I2] 

BfTGFR RATTGFBIIR] 

A[cyclin A RATPCNA] 

A[TCP (I I)] 

A[SC1 RNU19135] 



C{mGluR2 RATMGLURB] 0.00000 
B[trkC RATTRKCN3] 
A[CC02 RATMTCYTOC] 



B[IGF II RATGFI2] 



0.00000 



D[nAChRa2 RATNNAR] 0.00000 
A[IGFR2 MMU04710] 

D[mGluR6 RATMGLUR6.] 0.00000 
A[IGFR1 RATIGFI] 



D[mGluRl RATGPCGR] 0.00000 
AfEGFR RATEGFR] 



B[nAChRa4 RATNARAA] 0.00000 
A [IGF I RATIGFIA] 



A[IGF I RATIGFIA] 0.00000 



B[TH RATTOHA] 0.00000 
A[lnsR RATTNSAB] 
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CLAIMS 

1 . A coincidence detection method for use with a data set having a number of attributes, the 
method comprising the steps of: 

• representing a set of M objects in terms of a number N A of variables 

5 ("attributes"), where an attribute is said to occur in an object if the object 

possesses the attribute; 

sampling a subset of r { out of the M objects, for each iteration among a 
predetermined number of iterations; 

detecting and recording coincidences among sets of k of the attributes in each 
10 sampled subset of objects, a coincidence being the co-occurrence of 1 £ k < 

N A attributes in the same hj out of r 4 objects in the sampled subset, where 0 < 

• determining an expected count of coincidences for any set of k attributes and 
a predetermined number of iterations of sampling and coincidence-counting 

15 as described above, the determining being performed before sampling and 

collecting, at the same time or after sampling and collecting; 
comparing, for any set of k attributes and number of iterations of sampling 
and coincidence-counting, the observed count versus the expected count of 
coincidences, and from this comparison determining a measure of correlation 

20 (or association, or dependence) for the set of k attributes; and 

reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a set of k of the N A attributes which have been 
determined by this process to have a value for a chosen correlation measure 
above a predetermined threshold value. 

25 2. A coincidence detection method for use with a data set of objects having a number of 
attributes, the method comprising the steps of: 

sampling a subset of the data set for a predetermined number of iterations, 
each iteration the sampled subset of the data set having for each object the 
same subset of attributes; 
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detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
recording counts of coincidences in each sampled subset of the data set being 
performed before, at the same time or after sampling, detecting and recording 
counts of coincidences in other subsets; 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
detecting and recording; 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 

reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 

The coincidence detection method of claim 2, wherein the comparison of observed 
and expected counts is calculated using a Chernoff bound on tail probabilities. 

The coincidence detection method of claim 2, wherein the counts are recorded by 
storing a running total of the count of each coincidence over all of the sampled 
subsets. 



5. A method for visual exploration of a data set of objects having a number of 
attributes, the method comprising the steps of: 

sampling a subset of the data set for a predetermined number of iterations, 
each iteration the sampled subset of the data set having for each object the 
same subset of attributes; 

detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
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values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
recording counts of coincidences in each sampled subset of the data set being 
performed before, at the same time or after sampling, detecting and recording 
counts of coincidences in other subsets; 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
detecting and recording; 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 

reporting a set of k-tuples of correlated attributes to a user through a 
graphical interface, where a k-tuple of correlated attributes is a plurality of 
attributes for which the measure of correlation is above a respective pre- 
determined threshold. 

6. A pre-processing method for use with a data modelling unit to capture and report to 
the data modelling unit higher order interactions of a data set of objects having a 
number of attributes, the method comprising the steps of: 

sampling a subset of the data set for a predetermined number of iterations, 
each iteration the sampled subset of the data set having for each object the 
same subset of attributes; 

detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset, where the plurality of 
attribute values is the same for each occurrence, the detecting and recording 
counts of coincidences in each sampled subset being performed before, at the 
same time or after sampling, detecting and recording counts of coincidences 
in other subsets; 
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determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
detecting and recording; 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 

reporting to the data modelling unit a set of k-tuples of correlated attributes, 
where a k-tuple of correlated attributes is a plurality of attributes for which 
the measure of correlation is above a respective pre-determined threshold. 

7. A correlation elimination method for use with a data set of objects having a number 
of attributes, the method comprising the steps of: 

sampling a subset of the data set for a predetermined number of iterations, 
each iteration the sampled subset of the data set having for each object the 
same subset of attributes; 

detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
recording counts of coincidences in each sampled subset being performed 
before, at the same time or after sampling, detecting and recording counts of 
coincidences in other subsets; 
• determining an expected count for each coincidence of interest, the 

determining being performed before, at the same time, or after sampling, 
detecting and recording; 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 

- 161 - 



SUBSTITUTE SHEET (RULE 26) 



WO 98/43182 



PCT/CA98/00273 



eliminating a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 

8. The method of claim 2, wherein the objects are sales transactions, each transaction 
comprising one or more purchased products, and the attributes are instances of sale 
of particular products or types of products. 

9. The method of claim 2, wherein the objects are time slices and the attributes are the 
status of elements in a system. 

1 0. The method of claim 2, wherein the objects are time slices and the attributes are 
prices, or price changes of, financial instruments or commodities. 

1 1 . The method of claim 2, wherein the steps of the method are represented by the 
following pseudo-code: 



0. 


begin 


1. 


read (MATRIX); 


2. 


read(R,T); 


3. 


compute_first_order_marginals(MATRIX); 


4. 


csets :={}; 


5. 


for iter = 1 to T do 


6. 


sampledrows :=rsample(R, MATRIX): 


7. 


attributes :=get_attributes(sampled_rows); 


8. 


all_coincidences :=find_all_coincidences(attributes); 


9. 


for coincidence in all_coincidences do 


10. 


if cset_already_exists(coincidence, csets) 


11. 


then update_cset(coincidence, csets); 


12. 


else add_new_cset(coincidence, csets); 


13 


endif 


14 


. endfor 
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15. endfor 

16. for cset in csets do 

1 7 . expected : =compute_expected_match_count(cset) ; 

1 8 . observed :=get_observed_match_count(cset) ; 

19. stats :=update_stats(cset, hypoth_test(expected, observed)); 

20. endfor 

2 1 . print_final_stats(csets, stats); 

22. end 

12. A coincidence detection system for use with a data set of objects, each object having 
a plurality of attributes, the system comprising: 

• means for sampling a subset of the data set for a predetermined number of 
iterations, each iteration the sampled subset of the data set having for each 
object the same subset of attributes; 

means for detecting, and recording counts of, coincidences in each sampled 
subset of the data set, a coincidence being the co-occurrence of a plurality of 
attribute values in one or more objects in a sampled subset of the data set, 
where the plurality of attribute values is the same for each occurrence, the 
detecting and recording counts of coincidences in each sampled subset being 
performed before, at the same time or after sampling, detecting and recording 
counts of coincidences in other subsets; 

means for determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
detecting and recording; 

means for comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 

means for reporting a set of k-tuples of correlated attributes, where a k-tuple 
of correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 
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13. The coincidence detection system of claim 12, wherein the means of the system in 
the aggregate carry out a method represented by the following pseudo-code: 

0. begin 

1. read (MATRIX); 
5 2. read (R, T); 

3 . compute_first_order_margin2ds(MATRIX); 

4. csets :={}; 

5. for iter = 1 to T do 

6. sampled_rows :=rsample(R, MATRIX): 
10 7. attributes :=get_attributes(sampled_rows); 

8. all_coincidences :=find_all_coincidences(attributes); 

9. for coincidence in all_coincidences do 

10. if cset_already_exists(coincidence, csets) 

1 1 . then update_cset(coincidence, csets); 
15 12. else add_new__cset(coincidence, csets); 

13. endif 

14. endfor 

15. endfor 

16. for cset in csets do 

20 17. expected ~compute_expected_match_count(cset); 

18. observed :=get_observed_match_count(cset); 

19. stats :=update_stats(cset, hypoth_test(expected, observed)); 

20. endfor 

2 1 . print_final_stats(csets, stats); 
25 22 end 



14. The coincidence detection system of claim 12, wherein the means for sampling a 
subset of the data set comprises means for dividing the data set into subsets for 
sampling. 
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The coincidence detection system of claim 14, wherein the means for detecting and 
recording counts of coincidences comprises an array of processing nodes, each 
processing node detecting and recording a respective subcount of coincidences, and 
wherein the means for comparing, for each coincidence of interest, said observed 
count of coincidences to said expected count of coincidences comprises means for 
merging said subcounts to provide said observed count. 

The coincidence detection system of claim 15, wherein at least one of said processing 
nodes comprises a respective subarray of processing nodes that detect and record 
respective subsubcounts of coincidences, and wherein said means for merging 
merges said subsubcounts to provide said subcounts and/or said observed count. 

The coincidence detection system of claim 15 or 16, wherein each processing node 
comprises memory including an input buffer for storing received subsets of the data 
set and an output buffer for storing the subcount or the subsubcount; and a memory 
bus that transfers data to and from the memory. 

Coincidence detection programmed media for use with a computer and with a data 
set of objects having a number of attributes represented in a matrix of objects versus 
attributes, the programmed media comprising: 

a computer program stored on storage media compatible with the computer, 
20 ' the computer program containing instructions to direct the computer to: 

sample a subset of the data set for a predetermined number of 
iterations, each iteration the sampled subset of the data set having for 
each object the same subset of attributes; 

detect and record counts of coincidences in each sampled subset of 
25 the data set, a coincidence being the co-occurrence of a plurality of 

attribute values in one or more objects in a sampled subset of the data 
set, where the plurality of attribute values is the same for each 
occurrence, the detecting and recording counts of coincidences in 
each sampled subset being performed before, at the same time or after 
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sampling, detecting and recording counts of coincidences in other 
subsets; 

determine an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after 
5 sampling, detecting and recording; 

compare, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determine a measure of correlation for the plurality of 
attributes for the coincidence; and 
10 • report a set of k-tuples of correlated attributes, where a k-tuple of 

correlated attributes is a plurality of attributes for which the measure 
of correlation is above a respective pre-determined threshold. 

Coincidence detection system for use with a data set of objects having a number of 
attributes, the system comprising: 
a computer; and 

a computer program on media compatible with the computer, the computer program 
directing the computer to: 

sample a subset of the data set for a predetermined number of iterations, each 
iteration the sampled subset having for each object the same subset of 
attributes, 

• detect, and record counts of, coincidences in each sampled subset of the data 

set, a coincidence being the co-occurrence of a plurality of attribute values in 
one or more objects in a sampled subset of the data set, where the plurality of 
attribute values is the same for each occurrence, the detecting and recording 
counts of coincidences in each sampled subset being performed before, at the 
same time or after sampling, detecting and recording counts of coincidences 
in other subsets; 

determine an expected count for each coincidence of interest, the determining 
being performed before, at the same time, or after sampling, detecting and 
recording, 
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compare, for each coincidence of interest, the observed count of coincidences 
versus the expected count of coincidences, and from this comparison 
determine a measure of correlation for the plurality of attributes for the 
coincidence, and 

5 • report a set of k-tuples of correlated attributes, where a k-tuple of correlated 

attributes is a plurality of attributes for which the measure of correlation is 
above a respective pre-determined threshold. 

20. The coincidence method of claim 2, further comprising the step of representing the 
objects and attributes in a matrix of objects versus attributes prior to sampling the 

10 data set, the data set being sampled by sampling the matrix. 

21. A product having a set of attributes selected by: 
sampling a subset of a data set representing objects versus attributes for a 
predetermined number of iterations, each iteration the sampled subset having 
for each object the same subset of attributes, 

detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
recording counts of coincidences in each sampled subset being performed 
before, at the same time or after sampling, detecting and recording counts of 
coincidences in other subsets, 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
detecting and recording, 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence, and 
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reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 

22. A product defined by applying a set of rules generated from: 

sampling a subset of a data set representing objects versus attributes for a 
predetermined number of iterations, each iteration the sampled subset having 
for each object the same subset of attributes, 

detecting and recording counts of coincidences in each sampled subset of the 
data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset of the data set, where the 
plurality of attribute values is the same for each occurrence, the detecting and 
recording counts of coincidences in each sampled subset being performed 
before, at the same time or after sampling, detecting and recording counts of 
coincidences in other subsets, 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
detecting and recording, 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence, and 
• reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
correlation is above a respective pre-determined threshold. 

23 . A method comprising: 

the method of claim 2, and 
the further step of: 

applying rules that are defined by the reported correlated attributes. 
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24. A peptide or peptidomimetic including a structural motif of the V3 loop of HIV 
envelope protein including spatial coordinates of residues A18/Q31/H33. 

25. A pharmaceutical composition comprising a ligand that interacts with a protein 
having a structural motif identified using the method of claim 2, and a 
pharmaceutical^ acceptable carrier or exicipient therefor. 

26. The pharmaceutical composition of claim 25, wherein the ligand comprises chemical 
moieties of suitable identity and spatially located relative to each other so that the 
moieties interact with corresponding residues or portions of the motif. 

27. The pharmaceutical composition of claim 26, wherein the ligand, by interacting with 
the motif, interferes with function of a region of the protein comprising the motif 

28. An diagnostic agent comprising a ligand that interacts with a protein having a 
structural motif identified using the method of claim 2, and a detectable label linked 
to the ligand. 

29. A pharmaceutical composition for interacting with an envelope protein of human 
immunodeficiency virus (HIV), the envelope protein including a structural motif of 
the V3 loop having spatial coordinates of residues A18/Q31/H33, comprising a ligand 
including at least one functional group that interacts with the motif, and a 
pharmaceutically acceptable carrier or exicipient therefor. 

30. The pharmaceutical composition of claim 29, wherein the ligand includes at least one 
functional group capable of binding to and being present in an effective position in 
said ligand to bind to residue 18, at least one functional group capable of binding to 
and being present in an effective position in said ligand to bind to residue 3 1 , and at 
least one functional group capable of binding to and being present in an effective 
position in said ligand to bind to residue 33. 
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31. A method of designing a ligand to interact with a structural motif of an envelope 
protein of human immunodeficiency virus (HIV), the method comprising the steps 
of: providing a template having spatial coordinates of residues A 18, Q3 1 and H33 in 
the V3 loop of HTV envelope protein, and computationally evolving a chemical 
ligand using an effective algorithm with spatial constraints, so that said evolved 
ligand includes at least one effective functional group that binds to the motif. 

32. The method of claim 31, wherein the ligand comprises: at least one functional group 
capable of binding to and being present in an effective position in said ligand to bind 
to residue 18, at least one functional group capable of binding to and being present in 
an effective position in said ligand to bind to residue 31, and at least one functional 
group capable of binding to and being present in an effective position in said ligand 
to bind to residue 33. 

33. A method of identifying a ligand to bind with a structural motif of an envelope 
protein of human immunodeficiency virus (HTV), the method comprising the steps 
of: providing a template having spatial coordinates of A18, Q31 and H33 in the V3 
loop of HIV envelope protein; providing a data base containing structure and 
orientation of molecules; and screening said molecules to determine if they contain 
effective moieties spaced relative to each other so that the moieties interact with the 
motif 

34. The method of claim 33, wherein a first moiety of the molecule interacts with residue 
18, a second moiety of the molecule interacts with residue 3 1 and a third moiety of 
the molecule interacts with residue 33. 

35. Antigens and vaccines embodying the covarying k-tuples described herein. 

36. A product being defined by its interaction with a set of attributes selected by: 

sampling a subset of a data set representing objects versus attributes for a 
predetermined number of iterations, each iteration the sampled subset of ;he 
data set having for each object the same subset of attributes, 
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detecting, and recording counts of, coincidences in each sampled subset of 
the data set, a coincidence being the co-occurrence of a plurality of attribute 
values in one or more objects in a sampled subset, where the plurality of 
attribute values is the same for each occurrence, the detecting and recording 
5 counts of coincidences in each sampled subset being performed before, at the 

same time or after sampling, detecting and recording counts of coincidences 
in other subsets, 

determining an expected count for each coincidence of interest, the 
determining being performed before, at the same time, or after sampling, 
10 detecting and recording, 

• comparing, for each coincidence of interest, the observed count of 

coincidences versus the expected count of coincidences, and from this 
comparison determining a measure of correlation for the plurality of 
attributes for the coincidence, and 
15 • reporting a set of k-tuples of correlated attributes, where a k-tuple of 

correlated attributes is a plurality of attributes for which the measure of 
correlation is above a pre-determined threshold. 

The method of claim 2, wherein the objects are compounds and the attributes 
comprise particular chemical moieties. 

The method of claim 2, wherein the objects are peptides or proteins and the 
attributes comprise particular structural or substructural patterns or motifs. 

The method of claim 2, wherein the objects are selected from the group consisting of 
compounds, molecular structures, nucleotide sequence's and amino acid sequences 
and the attributes are features of the selected objects. 

The method of claim 2, wherein the objects are time slices and the attributes are 
biological parameters of genes or gene products. 
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41 . The method of claim 2, wherein the objects are documents that are electonically 
stored and/or electronically indexed and the attributes are topics. 

42. The method of claim 2, wherein the objects are customers and the attributes 
comprise products purchased or not purchased by those customers. 

43. The method of claim 42, wherein the attributes further comprise mailings made or 
not made to the customers. 

44. The method of claim 2, wherein the objects comprise products and the attributes 
comprise customers that have or have not purchased those products. 

45. The method of claim 44, wherein the attributes further comprise demographic 
variables of the customers. 

46. The method of claim 2, wherein the objects are people with a particular disease or 
discorder and the attributes are potential contributing factors for the disease or 
disorder. 

47. The method of claim 2, wherein the objects are people with a number of different 
diseases or disorders and the attributes are potential contributing factors for the 
diseases or disorders. 

48. The method of claim 2, wherein the objects comprise factors potentially contributing 
to a disease or disorder and the attributes are people with or without those factors, 
wherein the method associates groups of people of substantially equivalent risk for 
the disease or disorder. 

49. The method of claim 2, wherein the objects are time slices and the attributes 
comprise the state of components in a system at time slices prior to failure of the 
system, wherein the method associates component states that may potentially cause 
failure of the system. 
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50. The coincidence detection method of claim 1, where r t is the same for every iteration. 

5 1 . The method of claim 2, further comprising the steps of first creating a database of 
transitions between system states, wherein a system state is represented by a value of 
a state variable, over a chosen time quantum, and presenting the database, in whole 

5 or part, as a data set such that each state to state transition set corresponds to one of 

M objects and so that each state variable corresponds to an attribute. 

52. The method of claim 2, further comprising the steps of first creating a database of 
states and actions covering a chosen time quantum and presenting the database, in 
whole or part, as a data set such that each state/action/state triple corresponds to one 

10 of M objects and so that each state variable or action type corresponds to an 

attribute. 

53. A coincidence detection method for use with a data set of objects having a number of 
attributes represented in a matrix of objects versus attributes, the method comprising 
the steps of: 

15 • sampling a subset of the matrix for a predetermined number of iterations, 

each iteration the sampled subset of the matrix having for each object the 
same subset of attributes; 
• detecting, and recording counts of, coincidences in each sampled subset of 
the matrix, a coincidence being the co-occurrence of a plurality of attribute 

20 values in one or more objects in a sampled subset of the matrix, where the 

plurality of attribute values is the same for each occurrence, the detecting and 
recording counts of coincidences in each sampled subset being performed 
before, at the same time or after sampling, detecting and recording counts of 
coincidences in other subsets; 

25 • determining an expected count for each coincidence of interest, the 

determining being performed before, at the same time, or after sampling, 
detecting and recording; 

comparing, for each coincidence of interest, the observed count of 
coincidences versus the expected count of coincidences, and from this 
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comparison determining a measure of correlation for the plurality of 
attributes for the coincidence; and 

reporting a set of k-tuples of correlated attributes, where a k-tuple of 
correlated attributes is a plurality of attributes for which the measure of 
5 correlation is above a respective pre-determined threshold. 

54. The method of claim 1, wherein numerical correlation values are reported along with 
the set of k-tuples of correlated attributes. 
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